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Abstract 

Principal component analysis (PCA) is a widely used technique for data analysis and 
dimension reduction with numerous applications in science and engineering. However, 
the standard PCA suffers from the fact that the principal components (PCs) are usually 
linear combinations of all the original variables, and it is thus often difficult to interpret 
the PCs. To alleviate this drawback, various sparse PCA approaches were proposed in 
literature [TSJ El CEZ1 E3 El ESJ CEE1 El US] • Despite success in achieving sparsity, some 
important properties enjoyed by the standard PCA are lost in these methods such as 
uncorrelation of PCs and orthogonality of loading vectors. Also, the total explained 
variance that they attempt to maximize can be too optimistic. In this paper we propose 
a new formulation for sparse PCA, aiming at finding sparse and nearly uncorrelated 
PCs with orthogonal loading vectors while explaining as much of the total variance as 
possible. We also develop a novel augmented Lagrangian method for solving a class of 
nonsmooth constrained optimization problems, which is well suited for our formulation 
of sparse PCA. We show that it converges to a feasible point, and moreover under some 
regularity assumptions, it converges to a stationary point. Additionally, we propose 
two nonmonotone gradient methods for solving the augmented Lagrangian subproblems, 
and establish their global and local convergence. Finally, we compare our sparse PCA 
approach with several existing methods on synthetic, random, and real data, respectively. 
The computational results demonstrate that the sparse PCs produced by our approach 
substantially outperform those by other methods in terms of total explained variance, 
correlation of PCs, and orthogonality of loading vectors. 
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1 Introduction 



Principal component analysis (PCA) is a popular tool for data processing and dimension 
reduction. It has been widely used in numerous applications in science and engineering such 
as biology, chemistry, image processing, machine learning and so on. For example, PCA has 
recently been applied to human face recognition, handwritten zip code classification and gene 
expression data analysis (see [T0| [T2l [T| [TTj). 

In essence, PCA aims at finding a few linear combinations of the original variables, called 
principal components (PCs), which point in orthogonal directions capturing as much of the 
variance of the variables as possible. It is well known that PCs can be found via the eigenvalue 
decomposition of the covariance matrix E. However, E is typically unknown in practice. 
Instead, the PCs can be approximately computed via the singular value decomposition (SVD) 
of the data matrix or the eigenvalue decomposition of the sample covariance matrix. In detail, 
let £ = . . . , be a p-dimensional random vector, and X be an n x p data matrix, 
which records the n observations of £. Without loss of generality, assume X is centered, that 
is, the column means of X are all 0. Then the commonly used sample covariance matrix is 
E = X T X/{n — 1). Suppose the eigenvalue decomposition of E is 

t = VDV T . 

Then 77 = t^V gives the PCs, and the columns of V are the corresponding loading vectors. It 
is worth noting that V can also be obtained by performing the SVD of X (see, for example, 
[28]). Clearly, the columns of V are orthonormal vectors, and moreover V T I1V is diagonal. 
We thus immediately see that if E = E, the corresponding PCs are uncorrelated; otherwise, 
they can be correlated with each other (see Section [2] for details). We now describe several 
important properties of the PCs obtained by the standard PCA when E is well estimated by 
E (see also [28]): 

1. The PCs sequentially capture the maximum variance of the variables approximately, 
thus encouraging minimal information loss as much as possible; 

2. The PCs are nearly uncorrelated, so the explained variance by different PCs has small 
overlap; 

3. The PCs point in orthogonal directions, that is, their loading vectors are orthogonal to 
each other. 

In practice, typically the first few PCs are enough to represent the data, thus a great dimen- 
sionality reduction is achieved. In spite of the popularity and success of PCA due to these 
nice features, PCA has an obvious drawback, that is, PCs are usually linear combinations of 
all p variables and the loadings are typically nonzero. This makes it often difficult to interpret 
the PCs, especially when p is large. Indeed, in many applications, the original variables have 
concrete physical meaning. For example in biology, each variable might represent the expres- 
sion level of a gene. In these cases, the interpretation of PCs would be facilitated if they were 
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composed only from a small number of the original variables, namely, each PC involved a 
small number of nonzero loadings. It is thus imperative to develop sparse PCA techniques for 
finding the PCs with sparse loadings while enjoying the above three nice properties as much 
as possible. 

Sparse PCA has been an active research topic for more than a decade. The first class 
of approaches are based on ad-hoc methods by post-processing the PCs obtained from the 
standard PCA mentioned above. For example, Jolliffe [15J applied various rotation techniques 
to the standard PCs for obtaining sparse loading vectors. Cadima and Jolliffe [6J proposed a 
simple thresholding approach by artificially setting to zero the standard PCs' loadings with 
absolute values smaller than a threshold. In recent years, optimization approaches have been 
proposed for finding sparse PCs. They usually formulate sparse PCA into an optimization 
problem, aiming at achieving the sparsity of loadings while maximizing the explained variance 
as much as possible. For instance, Jolliffe et al. [17] proposed an interesting algorithm, called 
SCoTLASS, for finding sparse orthogonal loading vectors by sequentially maximizing the 
approximate variance explained by each PC under the /x-norm penalty on loading vectors. 
Zou et al. [28] formulated sparse PCA as a regression-type optimization problem and imposed 
a combination of l\- and Z 2 -norm penalties on the regression coefficients. d'Aspremont et al. [8] 
proposed a method, called DSPCA, for finding sparse PCs by solving a sequence of semidefinite 
program relaxations of sparse PCA. Shen and Huang [25j recently developed an approach for 
computing sparse PCs by solving a sequence of rank-one matrix approximation problems 
under several sparsity-inducing penalties. Very recently, Journee et al. [32] formulated sparse 
PCA as nonconcave maximization problems with l - or ii-norm sparsity-inducing penalties. 
They showed that these problems can be reduced into maximization of a convex function 
on a compact set, and they also proposed a simple but computationally efficient gradient 
method for finding a stationary point of the latter problems. Additionally, greedy methods 
were investigated for sparse PCA by Moghaddam et al. [IB] and d'Aspremont et al. [7J. 

The PCs obtained by the above methods [151 El HZl I2SI El 1251 UHl El US] are usually sparse. 
However, the aforementioned nice properties of the standard PCs are lost to some extent in 
these sparse PCs. Indeed, the likely correlation among the sparse PCs are not considered in 
these methods. Therefore, their sparse PCs can be quite correlated with each other. Also, 
the total explained variance that these methods attempt to maximize can be too optimistic as 
there may be some overlap among the individual variances of sparse PCs. Finally, the loading 
vectors of the sparse PCs given by these methods lack orthogonality except SCoTLASS [17] . 

In this paper we propose a new formulation for sparse PCA by taking into account the 
three nice properties of the standard PCA, that is, maximal total explained variance, un- 
correlation of PCs, and orthogonality of loading vectors. We also explore the connection of 
this formulation with the standard PCA and show that it can be viewed as a certain per- 
turbation of the standard PCA. We further propose a novel augmented Lagrangian method 
for solving a class of nonsmooth constrained optimization problems, which is well suited for 
our formulation of sparse PCA. This method differs from the classical augmented Lagrangian 
method in that: i) the values of the augmented Lagrangian functions at their approximate 
minimizers given by the method are bounded from above; and ii) the magnitude of penalty 
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parameters outgrows that of Lagrangian multipliers (see Section [3721 for details). We show 
that this method converges to a feasible point, and moreover it converges to a first-order sta- 
tionary point under some regularity assumptions. We also propose two nonmonotone gradient 
methods for minimizing a class of nonsmooth functions over a closed convex set, which can 
be suitably applied to the subproblems arising in our augmented Lagrangian method. We 
further establish global convergence and, under a local Lipschitzian error bounds assumption 
[26] . local linear rate of convergence for these gradient methods. Finally, we compare the 
sparse PC A approach proposed in this paper with several existing methods [28, [HJ, [25], [16] on 
synthetic, random, and real data, respectively. The computational results demonstrate that 
the sparse PCs obtained by our approach substantially outperform those by the other methods 
in terms of total explained variance, correlation of PCs, and orthogonality of loading vectors. 

The rest of paper is organized as follows. In Section [2] we propose a new formulation 
for sparse PCA and explore the connection of this formulation with the standard PCA. In 
Section [3] we then develop a novel augmented Lagrangian method for a class of nonsmooth 
constrained problems, and propose two nonmonotone gradient methods for minimizing a class 
of nonsmooth functions over a closed convex set. In Section H] we discuss the applicability 
and implementation details of our augmented Lagrangian method for sparse PCA. The sparse 
PCA approach proposed in this paper is then compared with several existing methods on 
synthetic, random, and real data in Section [5] Finally, we present some concluding remarks 
in Section [6] 

1.1 Notation 

In this paper, all vector spaces are assumed to be finite dimensional. The symbols 3ft™ and 3ft™ 
(resp., denote the n-dimensional Euclidean space and the nonnegative (resp., nonpositive) 
orthant of 3ft™, respectively, and 3ft ++ denotes the set of positive real numbers. The space of 
all m x n matrices with real entries is denoted by 3ft™**™. The space of symmetric n x n 
matrices is denoted by S n . Additionally, T>™ denotes the space of n x n diagonal matrices. 
For a real matrix X, we denote by |X| the absolute value of X, that is, \X\ij = \Xy\ for all 
ij, and by sign(X) the sign of X whose ijth entry equals the sign of X^ for all ij. Also, the 
nonnegative part of X is denoted by [X] + whose ijth entry is given by max{0, X^} for all ij. 
The rank of X is denoted by rank(X). Further, the identity matrix and the all-ones matrix 
are denoted by I and E, respectively, whose dimension should be clear from the context. If 
X G <S n is positive semidefinite, we write X y 0. For any X, Y G S n , we write X ^ Y to 
mean Y — X y 0. Given matrices X and Y in 3ft mx ™, the standard inner product is defined 
by X • Y := Ty(XY t ), where Tr(-) denotes the trace of a matrix, and the component-wise 
product is denoted by X © Y, whose ijth entry is XyYij for all ij. \\ ■ || denotes the Euclidean 
norm and its associated operator norm unless it is explicitly stated otherwise. The minimal 
(resp., maximal) eigenvalue of an n x n symmetric matrix X are denoted by A m i n (A) (resp., 
Amax(A)), respectively, and \{X) denotes its ith largest eigenvalue for i = 1, . . . ,n. Given 
a vector v G 3ft n , Diag(t>) or Diag(t> 1; . . . ,v n ) denotes a diagonal matrix whose ith diagonal 
element is Vi for i = 1, . . . ,n. Given an n x n matrix X, Diag(X) denotes a diagonal matrix 
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whose ith diagonal element is Xa for % = 1, . . . ,n. Let U be a real vector space. Given a 
closed convex set CCW, let dist(-, C) :U — > denote the distance function to C measured 
in terms of || ■ ||, that is, 

dist(u, C) := inf \\u - u\\ VueU. (1) 

2 Formulation for sparse PCA 

In this section we propose a new formulation for sparse PCA by taking into account sparsity 
and orthogonality of loading vectors, and uncorrelation of PCs. We also address the connection 
of our formulation with the standard PCA. 

Let £ = (£ < - 1 \ . . . , be a p-dimensional random vector with covariance matrix £. Sup- 
pose X is an n x p data matrix, which records the n observations of £. Without loss of general- 
ity, assume the column means of X are 0. Then the commonly used sample covariance matrix 
of £ is £ = X T X/{n — 1). For any r loading vectors represented as V — [Vx, . . . , V r ] G 9ft pxr 
where 1 < r < p, the corresponding components are given by r) = (r]^\ . . . , 77^) = £V, which 
are linear combinations of ^\ . . . , Clearly, the covariance matrix of 77 is V T T,V, and thus 
the components rj® and r]V> are uncorrelated if and only if the ijth entry of V T T,V is zero. 
Also, the total explained variance by the components rj^'s equals, if they are uncorrelated, 
the sum of the individual variances of r/^'s, that is, 

r 

J2V t T ZVi = Tr(V T ZV). 

i=l 

Recall that our aim is to find a set of sparse and orthogonal loading vectors V so that the 
corresponding components rj^\ . . . , r/^ are uncorrelated and explain as much variance of the 
original variables . . . , as possible. It appears that our goal can be achieved by solving 
the following problem: 

max Tr(V T £VO - p • \V\ 

s.t. V T YV is diagonal, (2) 
V T V = J, 

where p G 3ft pxr is a tunning parameter for controlling the sparsity of V. However, the covari- 
ance matrix £ is typically unknown and can only be approximated by the sample covariance 
matrix £. It looks plausible to modify (j2J) by simply replacing £ with £ at a glance. Never- 
theless, such a modification would eliminate all optimal solutions V* of (j2J) from consideration 
since {V*) T YiV* is generally non-diagonal. For this reason, given a sample covariance £, we 
consider the following formulation for sparse PCA, which can be viewed as a modification of 
problem (j5J), 

max Ti(V T tV) -p • \V\ 

s.t. |Vf£V}|<A^ Vi^j, (3) 
V T V = L 
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where Ay > (z 7^ j) are the parameters for controlling the correlation of the components 
corresponding to V. Clearly, Ay = A^j for all i ^ j. 

We next explore the connection of formulation ([3]) with the standard PCA. Before pro- 
ceeding, we state a technical lemma as follows that will be used subsequently. Its proof can 
be found in [20J. 

Lemma 2.1 Given any £ G <S n and integer 1 < r < n, define 

i r = max{l < i < n : Aj(£) > A r (£)}, z r = max{l < i < n : A,(£) = A r (£)}, (4) 

and /et /* 6e i/ie optimal value of 

max{Tr(SF) : O^Y ^1, Tr(Y) = r}. (5) 

Then, f* = $^i=i ^t(^); owd ^* an optimal solution of $\) if and only if Y* = U*Ui T + 
U;P*Uf, where P* G <S^ safc/^es ^ P* ^ I and Tr(P*) = r - i r , and U{ G K nx ^ and 
C/| G 9fj mx< * r_ ^) are ^/i e matrices whose columns consist of the orthonormal eigenvectors o/£ 
corresponding to the eigenvalues (Ai(£), . . . , Aj (£)) and (Aj +i(£), . . . , Aj r (£)), respectively. 

We next address the relation between the eigenvectors of £ and the solutions of problem 
([3]) when p = and Ay = for all % ^ j. 

Proposition 2.2 Suppose for problem (dp that p = and Ay = /or a££ % j. Let f* be the 

optimal value of (TJ|). Then, f* = Y7i=i ^i(^); anc ^ ^* e 3^™ xr is an optimal solution of (TJ|) if 
and only if the columns of V* consist of the orthonormal eigenvectors of £ corresponding to 
r largest eigenvalues o/E. 

Proof. We first show that /* = Y^i=i ^i(^)- Indeed, let [/ be an n x r matrix whose 
columns consist of the orthonormal eigenvectors of E corresponding to r largest eigenvalues 
of E. We then see that U is a feasible solution of ([3]) and Tr([/ T E?7) = Y7i=i ^i(^)- It follows 
that /* > Xli=i ^t'(^)- On t ne other hand, we observe that /* is bounded above by the optimal 
value of 

m&x{Tr(V T £V) : V T V = I, V G 5T xr }. 

We know from [5] that its optimal value equals YH=i \ C^)- Therefore, /* = Y7i=i \(^) holds 
and U is an optimal solution of It also implies that the "if" part of this proposition holds. 
We next show that the "only if" part also holds. Let V* G dt nxr be an optimal solution of ([3]), 
and define Y* = V*V* T . Then, we have V* T V* = I, which yields ^ Y* ^ I and Tr(F*) = r. 
Hence, Y* is a feasible solution of (jSJ). Using the fact that /* = Yli=i we then have 

r 

Tr(EF*) = Tr(V r * T EV*) = J^A^E), 

8=1 

which together with Lemma [2.11 implies that Y* is an optimal solution of (131). Let 2 r and i r 
be defined in (gj. Then, it follows from Lemma O that Y* = UfUf + U%P*Uf, where 
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P* e s^-ir satisfies ^ P* < I and Tr(P* 



and U{ G K nx ^ and t/| G K nx (^-^) 



are the matrices whose columns consist of the orthonormal eigenvectors of E corresponding 



to the eigenvalues (Ai(E), . . . , A j (E)) and (A 



Ai r (E)), respectively. Thus, we have 



tu* = u*a, tu; = x r {t)u 



2 ■ 



(6) 



where A = Diag(Ai(E), . . . , Aj (E)). In addition, it is easy to show that rank(F*) = i r + 
rank(P*). Since Y* = V*V* T and V* T V* = I, we can observe that rank(F*) = r. Hence, 
rank(P*) = r — i r , which implies that P* has only r — i r nonzero eigenvalues. Using this 
fact and the relations ^ P* -< I and Tr(P*) = r — i r , we can further conclude that r — i r 
eigenvalues of P* are 1 and the rest are 0. Therefore, there exists W G 3fj(* r ~^) x ( r -*r) such 
that 

W T W = I, P* = WW T . (7) 

It together with Y* = U^Uf + U^P*Uf implies that Y* = U*U* T , where U* = [E/J U£W]. In 
view of © and the identities UfU{ = J, UfU% = I and UfU* 2 = 0, we see that U* T U* = I. 
Using this result, and the relations V* T V* = I and Y* = U*U* T = V*V* T , it is not hard 
to see that the columns of U* and V* form an orthonormal basis for the range space of Y*, 
respectively. Thus, V* = U*Q for some Q G W xr satisfying Q T Q = I. Now, let D = V* T tV*. 
By the definition of V*, we know that D is an r x r diagonal matrix. Moreover, in view of ([6]), 
©, the definition of U*, and the relations V* = U*Q, UfU{ = I, UfU% = I and UfU* 2 = 0, 
we have 



D 



v* T zv* = q t u* t j:u*q = q 



Ttt*T^tt*, 



Q 1 



w T uf 



A 
A r (E)J 



Q. 



t[u{ u;w]q 



which together with Q T Q = I implies that D is similar to the diagonal matrix appearing on 
the right-hand side of (jSj). Hence, the diagonal elements of D consist of r largest eigenvalues 
of E. In addition, let Q\ G ^R lrXr and Q 2 G 9£( r ~W xr be the submatrices corresponding to the 
first i r and the last r — i r rows of Q, respectively. Then, in view of the definition of U* and 
V* = U*Q, we have 



[u* u;w] = u* = v*Q q 



[V*Ql 



V*Q T 2 ] 



Thus, we obtain that U{ = V*Qj and U£W = V*Q\. Using these identities, 
relation V* = U*Q, we have 



(181), and the 



EU* = EU*Q 



e[u* u;w\Q = [U*A k(t)u;w]q 

Ti^ _ „*„T \ A 

A r (E)/ 



[v*Qi a A r (E)u*g^]g = v*q 



Q 



V*D. 



It follows that the columns of V* consist of the orthonormal eigenvectors of E corresponding 
to r largest eigenvalues of E, and thus the "only if" part of this proposition holds. ■ 
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From the above proposition, we see that when p = and Ay = for all i ^ j, each solution 
of ([3]) consists of the orthonormal eigenvectors of E corresponding to r largest eigenvalues of 
E, which can be computed from the eigenvalue decomposition of E. Therefore, the loading 
vectors obtained from ([3]) are the same as those given by the standard PCA when applied to 
E. On the other hand, when p and A^ for all i ^ j are small, the loading vectors found by 
([3]) can be viewed as an approximation to the ones provided by the standard PCA. We will 
propose suitable methods for solving ([3]) in Sections [3] and HI 

3 Augmented Lagrangian method for nonsmooth con- 
strained nonlinear programming 

In this section we propose a novel augmented Lagrangian method for a class of nonsmooth 
constrained nonlinear programming problems, which is well suited for formulation ([3]) of sparse 
PCA. In particular, we study first-order optimality conditions in Subsection 13. II In Subsection 
13.21 we develop an augmented Lagrangian method and establish its global convergence. In 
Subsection 13. 3[ we propose two nonmonotone gradient methods for minimizing a class of non- 
smooth functions over a closed convex set, which can be suitably applied to the subproblems 
arising in our augmented Lagrangian method. We also establish global and local convergence 
for these gradient methods. 

3.1 First-order optimality conditions 

In this subsection we introduce a class of nonsmooth constrained nonlinear programming 
problems and study first-order optimality conditions for them. 
Consider the nonlinear programming problem 

min f{x) + P{x) 

s.t. gi(x)<0, i = l,...,m, , g . 
hi(x) = 0, % = 1, . . . ,p, 
x G X. 

We assume that the functions / : 3ft n — > 3ft, <fc : 3ft n — > 3ft, i = 1, . . . , m, and hi : 3ft n — > 3ft, 
i = 1, . . . ,p, are continuously differentiable, and that the function P : 3ft n — > 3ft is convex but 
not necessarily smooth, and that the set X C 3ft n is closed and convex. For convenience of the 
subsequent presentation, we denote by Q the feasible region of problem Qj. 

Before establishing first-order optimality conditions for problem (Q, we describe a general 
constraint qualification condition for (Q, that is, Robinson's condition that was proposed in 

Let x G 3ft™ be a feasible point of problem We denote the set of active inequality 
constraints at 

A(x) = {1 < i < m : gi(x) = 0}. 
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In addition, x is said to satisfy Robinson's condition if 



g'{x)d — v 
h'(x)d 



: d G T x (x), v g , Vi < 0, % e A{x) \ = 3? m x W 



(10) 



where g\x) and h'(x) denote the Jacobian of the functions g = (gi, . . . , g m ) and h = (hi, ... , h p ) 
at x, respectively. Other equivalent expressions of Robinson's condition can be found, for ex- 
ample, in [2HE21E1]. 

The following proposition demonstrates that Robinson's condition is indeed a constraint 
qualification condition for problem (Q. For the sake of completeness, we include a brief proof 
for it. 



Proposition 3.1 Given a feasible point x G dt n of problem (TJ|) ; let Tq(x) be the tangent cone 
to fl at x, and (Tq(x))° be its polar cone. If Robinson's condition [W\) holds at x, then 

Tn{x) ~ \ deT ^ x >- <FVh i (x) = 0, i = l,...,pf> 
(T Q (x))° = I J2 ^g l (x) + Y,^h l (x) + N x (x):\e$t™, fiew\, (11) 

[ie^(x) i=i J 

where Tx{x) and Nx{x) are the tangent and normal cones to X at x, respectively. 

Proof. By Theorem A. 10 of [21], we see that Robinson's condition fflQ|) implies that the 
assumption of Theorem 3.15 of [24J is satisfied with 

x = x, X = X, Y = K m xW, g(-) = (g 1 (-y,...;g m (-y,h 1 (-y,...;h p (-)). 

The first statement then follows from Theorem 3.15 of [24J with the above Xq, X , Yq and g(-). 
Further, let A(x) denote the matrix whose rows are the gradients of all active constraints at 
x in the same order as they appear in (Q. Then, Robinson's condition (TlQl) implies that the 
assumptions of Theorem 2.36 of [21] are satisfied with 

A = A(x), K x = T x (x), K 2 = $ A{x)l x W. 

Let K = {d G Ki : Ad G K 2 }. Then, it follows from Theorem 2.36 of [24J that 

(T n (x))° = K° = K{ + {A T i : £ G K°}, 

which together with the identity (Tx(x))° = Nx{x) and the definitions of A, K\ and K 2 , 
implies that the second statement holds. ■ 

We are now ready to establish first-order optimality conditions for problem ([9]). 
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Theorem 3.2 Let x* G 9ft ra be a local minimizer of problem Assume that Robinson's 
condition OOj) is satisfied at x* . Then there exist Lagrange multipliers A G 9ft+ and \i G W 
such that 

m p 

G Vf(x*) + dP(x*) + ^ XiV gi {x*) + J2^ i Vh i (x*) + iV x (x*), (12) 

i=l i=\ 

and 

\ igi (x*) = 0, i = l,...,m. (13) 

Moreover, the set of Lagrange multipliers (A, /i) G 9ft™ x 9ft p satisfying the above conditions, 
denoted by A(x*), is convex and compact. 

Proof. We first show that 

d T Vf(x*) + P'(x*; d)>0 \/d G T Q (x*). (14) 

Let d G Tq(x*) be arbitrarily chosen. Then, there exist sequences {x k }^ =l C f2 and {t^}^ C 
3ft ++ such that j and 

d = lim 



Thus, we have x k = x* + t^d + o(tfc). Using this relation along with the fact that the function 
/ is differentiable and P is convex in 3ft™, we can have 

f(x* + t k d) - f(x k ) = o(t k ), P(x* + t k d) - P(x k ) = o(t k ), (15) 

where the first equality follows from the Mean Value Theorem while the second one comes 
from Theorem 10.4 of [23]. Clearly, x k — > x*. This together with the assumption that x* is a 
local minimizer of implies that 

f{x k ) + P(x k ) > f(x*) + P(x*) (16) 

when k is sufficiently large. In view of (fT5|) and f[T6"j) . we obtain that 

#V /(*') + H = lim gMlg + lim ^* + M) " W 



fe^oo t^, k^oo t k 

f(x k ) + P(x k )-f(x*)-P(x*) , o(f fc ) 



lim 



Hm /(x fc ) + P(x fc )-/(x*)-P(x*) > Q 

and hence (fill) holds. 

For simplicity of notations, let = (Tq(s*))° and 5 = —Vf(x*) — dP(x*). We next show 
that 5 fl Tq 7^ 0. Suppose for contradiction that S fl = 0. This together with the fact that 
5 and Tq are nonempty closed convex sets and S is bounded, implies that there exists some 
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d G 3? ra such that d T y < for any y G Tq, and d T y > 1 for any y G S. Clearly, we see that 
dG(T^)°=T n (x*),and 

l<Md T y= inf d T (-V/(x*) -2) = -d T Vf(x*)- sup d T z = -d T Vf(x*)-P'(x*;d) } 

yeS z€dP(x*) z£dP(x*) 

which contradicts (THjh Hence, we have S 1 fl 7^ 0. Using this relation, ffTTj) . the definitions 
of 5" and .4(£*), and letting Aj = for ? ^ A(x*), we easily see that (fl2l) and (fl3l) hold. 

In view of the fact that dP(x*) and N x (x*) are closed and convex, and moreover dP(x*) 
is bounded, we know that dP(x*) + N x (x*) is closed and convex. Using this result, it 
is straightforward to see that A (2*) is closed and convex. We next show that A (a;*) is 
bounded. Suppose for contradiction that A (2*) is unbounded. Then, there exists a sequence 
{(X k , H k )}^ =1 C AO*) such that ||(A fe ,/?)|| -> 00, and 

m p 

= V/(x*) + z k + ^9i{x*) + rf^W**) + vk ( 17 ) 

i=l i=l 

for some {z k }^ =1 C dP(x*) and {v*}^ C iV x (i*). Let (A fc ,/x fe ) = (A fc , //)/||_(A fc , //) || . 

By passing to a subsequence if necessary, we can assume that (X k , Ji k ) — > (A, /x). We clearly 
see that ||(A,/2)|| = 1, A G 3ft+, and A* = for i A(x*). Note that dP(x*) is bounded and 
N x (x*) is a closed cone. In view of this fact, and upon dividing both sides of ( lP7j) by || (A fc , /x fc ) || 
and taking limits on a subsequence if necessary, we obtain that 

m p 

= ^Wg i (x*) + ^jl i Vh i (x*)+v (18) 

i=l i=l 

for some t> G Nx{x*). Since Robinson's condition fflUj) is satisfied at x*, there exist d G Tx(x*) 
and v G 5i m such that < for 2 G -4(a;*), and 

rf T V^(x*)-Wi = -Aj V? G ^4(x*), 

C? T V/lj(x*) = — yUj, i = l,...,J9. 

Using these relations, ([18]) and the fact that d G T x (x*), v G N x (x*), A G 3?+, and Aj = for 
z ^ -4.0*), we have 

m _ p m _ p 

E\ 2 + E^ 2 < -E^ T V^(x*)-EM T V^(a;*), 

i=l i=l i=l i=l 

(m _ p \ 

EA,Vft(f) + E^M = ^ < 0. 
i=l i=l / 

It yields (A, p) = (0,0), which contradicts the identity ||(A, p,)\\ = 1. Thus, A(x*) is bounded. 
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3.2 Augmented Lagrangian method for ([9]) 

For a convex program, it is known that under some mild assumptions, any accumulation 
point of the sequence generated by the classical augmented Lagrangian method is an optimal 
solution (e.g., see []). Nevertheless, when problem (EJ) is a nonconvex program, especially when 
the function hi is not affine or <?j is nonconvex, the classical augmented Lagrangian method 
may not even converge to a feasible point. To alleviate this drawback, we propose a novel 
augmented Lagrangian method for problem (Q and establish its global convergence in this 
subsection. 

Throughout this subsection, we make the following assumption for problem (J9j). 

Assumption 1 Problem |PP is feasible, and moreover at least a feasible solution, denoted by 
x , is known. 

It is well-known that for problem (Q the associated augmented Lagrangian function 
L e (x, A,|i):rxrxy^S is given by 

L g (x,\,fj,) :=w{x) +P(x), (19) 

where 

w{x) := f{x) + i(||[A + ^(*)] + || 2 - ||A|| 2 ) + fh{x) + f ||M*)H 2 , (20) 

and q > is a penalty parameter (e.g., see [21 Ell)- Roughly speaking, an augmented La- 
grangian method, when applied to problem (Q, solves a sequence of subproblems in the form 
of 

minL^a;, A, /i) 

while updating the Lagrangian multipliers (A, n) and the penalty parameter g. 

Let x feas be a known feasible point of (Q (see Assumption [1]). We now describe the 
algorithm framework of a novel augmented Lagrangian method as follows. 

Algorithm framework of augmented Lagrangian method: 

Let {e/c} be a positive decreasing sequence. Let A G 9?™, /i° G 3? p , £>o > 0, r > 0, a > 1 be 
given. Choose an arbitrary initial point x° nit G X and constant T > max{/(x feas ), L go (x® nit , A , /i' 
Set k = 0. 

1) Find an approximate solution x k G X for the subproblem 

mmL gk (x,X k ,fi k ) (21) 

such that 

dist (-Vw(x k ),dP{x k ) + N x {x k )) < e k , L Sk {x k ,X k ,fi k ) < T. (22) 
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2) Update Lagrange multipliers according to 



A fc +! := [\ k + g k g(x k )] + , := / + Qk h{x k ). (23) 

3) Set g k+1 := max {<jg k , ||A fc+1 || 1+T , \\fi k+1 \\ 1+T } . 

4) Set k <— k + 1 and go to step 1). 
end 

The above augmented Lagrangian method differs from the classical augmented Lagrangian 
method in that: i) the values of the augmented Lagrangian functions at their approximate 
minimizers given by the method are bounded from above (see Step 1)); and ii) the magnitude 
of penalty parameters outgrows that of Lagrangian multipliers (see Step 2)). 

In addition, to make the above augmented Lagrangian method complete, we need address 
how to find an approximate solution x k G X for subproblem ( !2~TT) satisfying (122]) as required 
in Step 1). We will leave this discussion to the end of this subsection. For the time being, we 
establish the main convergence result regarding this method for solving problem ([9]). 

Theorem 3.3 Assume that e k — > 0. Let {x k } be the sequence generated by the above aug- 
mented Lagrangian method satisfying (2E) . Suppose that a subsequence {x k }keK converges to 
x* . Then, the following statements hold: 

(a) x* is a feasible point of problem (G|)/ 

(b) Further, if Robinson's condition ( T7Qj) is satisfied atx* , then the subsequence {(A , /i +1 )}fc e x 
is bounded, and each accumulation point (A*,//*) of {(A fc+1 , /i fc+1 )}fc e x is the vector of 
Lagrange multipliers satisfying the first-order optimality conditions (Q1|)-(Q2|) at x* . 

Proof. In view of ( fl9l) . ( |20l) and the second relation in (|22"I) . we have 
f(x k ) + P{x k ) + ^-(||[A fc + Qk g{x k )} + \\ 2 - \\X k f) + ^ k fh(x k ) + f\\h(x k )\\ 2 < T V*. 
It follows that 

\\[X k /Qk + 9(x k )} + \\ 2 + \\h(x k )\\ 2 < 2[T - f{x k ) - g{x k ) - (» k ) T h(x k ))/ g k + (||A fc ||/^) 2 . 

Noticing that g > t > 0, and g k+1 = max {ag k , ||A fc+1 || 1+r , ||/U fc+1 || 1+r } for k > 0, we 
can observe that g k — > oo and \\(\ k , n k )\\/ g k — > 0. We also know that {x k } k£ K 
{g(x k )} k€ K g(x*) and {h(x k )} ke K h(x*). Using these results, and upon taking limits as 
k G K — > oo on both sides of the above inequality, we obtain that 

||[^(x*)] + || 2 +||Mx*)|| 2 <0, 

which implies that g(x*) < and h(x*) = 0. We also know that x* G X. It thus follows that 
statement (a) holds. 
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We next show that statement (b) also holds. Using (J2U), (USD, flU, flUD, and the first 
relation in (122|) . we have 

|| V/(x fc ) + (A fc+1 ) T V(7(x fe ) + {^ k+1 ) T Vh(x k ) + z k + v k \\ < e k (24) 

for some z k G dP(x k ) and t> fc G N x (x k ). Suppose for contradiction that the subsequence 
{(A fc+1 , fi k+1 )}keK is unbounded. By passing to a subsequence if necessary, we can assume 
that {(X k+1 ,n k+1 )} keK -> oo. Let (A fc+1 ,/i fc+1 ) = (A fe+1 , Ai* +1 )/||(A fc+1 , /x fe+1 )|| and v k = 
v k /\\(X h+1 ,fj l k+1 )\\. Recall that {x k } keK -> s*. It together with Theorem 6.2.7 of [13] im- 
plies that Vji e&K dP{x k ) is bounded, and so is {z k }k^K- hi addition, {g(x k )}k^K g( x *) and 
{h(x k )}k£K h(x*). Then, we can observe from (J241) that {■y fc }/ ce ^ is bounded. Without loss 
of generality, assume that {(A fc+1 , p k+1 )}k^K - > (A, ft) and {v k }k<=K v (otherwise, one can 
consider their convergent subsequences). Clearly, ||(A, p)\\ = 1. Dividing both sides of ( 1241) by 
||(A fe+1 ,/i fc+1 )|| and taking limits as k G k — > oo, we obtain that 

X T Vg(x*) + p T Vh(x*) + v = 0. (25) 

Further, using the identity A fc+1 = [\ k + Qkg(x k )} + and the fact that Qt — > oo and ||A fe ||/^fe — > 0, 
we observe that A fc+1 G 9?+ and A^ +1 = for i ^ ^4(x*) when G if is sufficiently large, which 
imply that A G 9ft+ and Aj = for i ^ A(x*). Moreover, we have v G Nx{x*) since Nx(x*) is a 
closed cone. Using these results, (1251) . Robinson's condition (flQj) at x*, and a similar argument 
as that in the proof of Theorem 13.21 we can obtain that (A, p) = (0, 0), which contradicts the 
identity ||(A, p,)\\ = 1. Therefore, the subsequence {(\ k+1 , ^ k+1 )}keK is bounded. Using this 
result together with (T241) and the fact {z h }keK is bounded, we immediately see that {v h }keK is 
bounded. Using semicontinuity of dP(-) and Nx(-) (see Theorem 24.4 of [23] and Lemma 2.42 
of [21]), and the fact {x k }k^K x*, we conclude that every accumulation point of {z k } keK 
and {v k }k£K belongs to dP(x*) and Nx(x*), respectively. Using these results and ( 1241) . we 
further see that for every accumulation point (A*,//*) of {(A fc+1 , fi k+1 )}keK, there exists some 
z* G dP(x*) and G N x (x*) such that 

V/(x*) + (X*) T Vg(x*) + (n*) T Vh(x*) + z *+v* = 0. 

Moreover, using the identity A fe+1 = [A fc + Qkg(x h )} + and the fact that Qk — > oo and ||A fc ||/^fc — > 
0, we easily see that A* G 3?+ and A* = for % A(x*). Thus, (A*,//*) satisfies the first-order 
optimality conditions (fI21) - (fT3l) at x*. ■ 

Before ending this subsection, we now briefly discuss how to find an approximate solution 
x k G X for subproblem (1211) satisfying ( 1221) as required in Step 1) of the above augmented 
Lagrangian method. In particular, we are interested in applying the nonmonotone gradient 
methods proposed in Subsection 13.31 to (T2Tj) . As shown in Subsection 13.31 (see Theorems 13.91 
and I3.13p . these methods are able to find an approximate solution x k G X satisfying the 
first relation of (1221) . Moreover, if an initial point for these methods is properly chosen, the 
obtained approximate solution x k also satisfies the second relation of (1221) . For example, given 
k > 0, let x k nit G X denote the initial point for solving the fcth subproblem ([2Tp . and we define 
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x k nit for k > 1 as follows 

x k = (x { ™ s , iiL ek {x k -\\ k ^ k )>T- 
mit | x k ~ 1 , otherwise, 

where x k ~ x is the approximate solution to the (k — l)th subproblem (l2l|) satisfying (|22|) (with 
k replaced by k — 1). Recall from Assumption [1] that x feas is a feasible solution of (Q. Thus, 
g(x teas ) < 0, and h(x feas ) = 0, which together with (fl9j) . ( J20l) and the definition of T implies 
that 

L, fc (x feas ,A fe ,//) < /(x feas ) < T. 

It follows from this inequality and the above choice of x k nit that L gk (x k nit , X k , //') < T. Ad- 
ditionally, the nonmonotone gradient methods proposed in Subsection 13.31 possess a natural 
property that the objective function values at all subsequent iterates are bounded below by 
the one at the initial point. Therefore, we have 

L gk (x k ,X k ,fi k ) < L ek (x k nit ,X k ,fi k ) < T, 

and so the second relation of (1221) is satisfied at x k . 



3.3 Nonmonotone gradient methods for nonsmooth minimization 

In this subsection we propose two nonmonotone gradient methods for minimizing a class of 
nonsmooth functions over a closed convex set, which can be suitably applied to the sub- 
problems arising in our augmented Lagrangian method detailed in Subsection 13.21 We also 
establish global convergence and local linear rate of convergence for these methods. It shall 
be mentioned that these two methods are closely related to the ones proposed in [27j and [26], 
but they are not the same (see the remarks below for details). In addition, these methods 
can be viewed as an extension of the well-known projected gradient methods studied in [I] for 
smooth problems, but the methods proposed in [27] and [26J cannot. 
Throughout this subsection, we consider the following problem 

min{F(x) := f(x) + P(x)}, (26) 

where / : 3? n — > 3? is continuously differentiable, P : 3? ra — > 3? is convex but not necessarily 
smooth, and X C 5J n is closed and convex. 

We say that x G 5i n is a stationary point of problem (1261) if x G X and 

G V/(ar) + dP(x) + N x (x). (27) 

Given a point x G K n and H y 0, we denote by dn(x) the solution of the following 
problem: 

d H (x) := argmin j Vf(x) T d+ ^d T Hd + P(x + d) : x + d G x\ . (28) 

The following lemma provides an alternative characterization of stationarity that will be 
used in our subsequent analysis. 
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Lemma 3.4 For any H y 0, x G X is a stationary point of problem l[26)) if and only if 
oIh{x) = 0. 

Proof. We first observe that (1281) is a convex problem, and moreover its objective function 
is strictly convex. The conclusion of this lemma immediately follows from this observation 
and the first-order optimality condition of ( 1281) . ■ 

The next lemma shows that ||(if/(:r)|| changes not too fast with H. It will be used to prove 
Theorems 13.101 and 13.141 

Lemma 3.5 For any x G 9ft n , H y 0, and H y 0, let d = dn{x) and d = dg(x). Then 
nj,, . 1 + A max (Q) + y/1 - 2A min (Q) + qgg 

\\ d \\ < — 7^ *mK*{H)\\d\\, (29) 

where Q = H^^HH- 1 ! 2 . 

Proof. The conclusion immediately follows from Lemma 3.2 of [26] with J = {1, . . . , n}, 
c = 1, and P(x) := P(x) + Ix{x), where Ix is the indicator function of X. m 

The following lemma will be used to prove Theorems 13.101 and 13.141 

Lemma 3.6 Given x G 3? n and H y 0, let g = Vf(x) and = g T d + P(x + d) — P(x) for 
all d G 3? n . Le£ a G (0, 1) 6e given. The following statements hold: 

(a) Ifd = dn(x), then 

-A d > d T Hd > \ min (H)\\d\\ 2 . 

(b) For any x G 9ft n ; a G (0, 1], d = d H (x), and x' = x + ad, then 

(g + Hd) T (x' -x) + P(x') - P(x) < (a - l)(d T Hd + A d ). 

(c) If f satisfies 

\\Vf(y)-Vf(z)\\<L\\y-z\\ Vy,zE^ n (30) 
for some L > 0, then the descent condition 

F(x + ad) < F(x) + aaA d 

is satisfied for d = dn(x), provided < a < min{l, 2(1 — a) \ min (H) / L} . 

(d) If f satisfies [3fy) , then the descent condition 

F{x + d)< F{x) + aA d 
is satisfied for d = dH( a )(x), where H(a) = aH, provided a > L/[2(l — cr)A min (i/)] . 
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Proof. The statements (a)-(c) follow from Theorem 4.1 (a) and Lemma 3.4 of [26J with 
J = {1, . . . , n}, 7 = 0, and A = A m i n (if). We now prove statement (d). Let d = dH( a )(x), 
where H(a) = aH for a > 0. It follows from statement (a) that ||<i|| 2 < —Ad/(aX m i n (H)). 
Using this relation, (1301) . and the definitions of F and A d , we have 

F(x + d)-F(x) = f(x + d)- f(x) + P(x + d)-P(x) 

= Vf{x) T d + P{x + d) - P(x) + [ d T {Vf{x + td) -Wf(x))dt 

Jo 

< A d + f 1 \\Vf(x + td)-Vf(x)\\\\d\\dt 

Jo 

< A d + L\\d\\ 2 /2 < [l-L/(2a\ Ui UH))]A d , 

which together with a > L/[2(l — cr)A min (if )], immediately implies that statement (d) holds. 
■ 

We now present the first nonmonotone gradient method for (1261) as follows. 
Nonmonotone gradient method I: 

Choose parameters r] > 1, < a < 1, < a< a, < X< X, and integer M > 0. Set k = 
and choose x° G X . 

1) Choose a£ e [a, a] and XI ^ ^ XL 

2) For j = 0,1,... 

2a) Let = a^rf . Solve f l28|) with x = x k and H = a^Flk to obtain d k 
2b) If d k satisfies 

F(x k + d k ) < max Ffx 1 ') + aA k , 

[k-M] + <i<k 

go to step 3), where 

A k := Vf{x k ) T d k + P{x k + d k ) - P(x k ). (32) 

3) Set x k+1 = x k + d k and k <— k + 1. 
end 

Remark. The above method is closely related to the one proposed in [27]. They differ 
from each other only in that the distinct A^'s are used in fl3T|) . Indeed, the latter method 
uses Afc = — afc||<i' c || 2 /2. Nevertheless, for global convergence, the method [27J needs a strong 
assumption that the function / is Lipschitz continuously differentiable, which is not required 
for our method (see Theorem 13.91) . In addition, our method can be viewed as an extension of 
the first projected gradient method (namely, SPG1) studied in [4J for smooth problems, but 
their method cannot. Finally, local convergence is established for our method (see Theorem 
I3.10p . but not studied for their method. ■ 



= d H (x). 

(31) 
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We next prove global convergence of the nonmonotone gradient method I. Before proceed- 
ing, we establish two technical lemmas below. The first lemma shows that if x k G X is a 
nonstationary point, there exists an a k > in step 2a) so that (13~T1) is satisfied, and hence the 
above method is well defined. 



Lemma 3.7 Suppose that H k y and x k G X is a nonstationary point of problem §2i 
Then, there exists a > such that d k = dH k (a k )(% k ) , where H k (a k ) = ot k H k , satisfies (3l\) 
whenever a k > a. 

Proof. For simplicity of notation, let d(a) = d^ fc ( a )(x fc ), where H k (a) = aH k for any a > 0. 
Then, it follows from ( 1281) that for all a > 0, 



2[Vf(x k ) T d(a) + P{x k + d{a)) - P{x k )} 2F'(x k , d{a)/\\d{a) ||) 
a o(a) < 7 777 xii ,/ rn < r 777T • 33 

Thus, we easily see that the set S := {a||<i(a)|| : a > 0} is bounded. It implies that ||c?(a)|| — > 
as a — > 00. We claim that 

liminfa||d(a)|| > 0. (34) 

a— >oo 

Suppose not. Then there exists a sequence {d[} f 00 such that a^||(i(a/)|| — > as / — ► 00. 
Invoking that d{a{) is the optimal solution of f[2"5j) with 2 = x k , H = aiH k and a = ai, we 
have 

G V/(x fc ) + aiH k d(a{) + dP(x k + d(a,)) + iV x (x fc + d(a,)). 

Upon taking limits on both sides as I —>■ 00, and using semicontinuity of dP(-) and Nx{-) 
(see Theorem 24.4 of [23] and Lemma 2.42 of [21]), and the relations ||d(cty)|| — > and 
a; || || — ► 0, we see that fl27|) holds at which contradicts the nonstationarity of a; fc . 
Hence, ( 1341) holds. We observe that 

ad(a) T H k d(a) > X mhi (H k )a\\d(a) || 2 , 

which together with fl34|) and if^ >- 0, implies that ||d(a)|| = O (ad(a) T H k d{af) as a — > 00. 
In addition, using the boundedness of £ and if^ >- 0, we have ad(a) T H k d(a) = O (||d(a)||) as 
a — > 00. Thus, we obtain that 

ad(a) T H k d(a) = 0(||d(a)||) as a — > 00. (35) 

This relation together with Lemma 13.6( a) implies that 

P(x k ) - Vf{x k ) T d{a) - P{x k + d{a)) > ad{a) T H k d{a) = 0(||d(a)||). (36) 

Using this result, and the relation ||c?(a)|| ->0asa-> 00, we further have 

F(x k + d(a))- max Fix*) < F(x k + d(a)) - F(x k ) 

[k-M]+<i<k 

= f{x k + d(a)) - f{x k ) + P{x k + d(a)) - P{x k ) 

= X7f(x k ) T d(a) + P(x k + d{a)) - P(x k ) + o(||d(a)||) 

< a[Vf{x k ) T d{a) + P{x k + d{a))-P(x k )}, (37) 
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provided a is sufficiently large. It implies that the conclusion holds. ■ 

The following lemma shows that the search directions {d k } approach zero, and the sequence 
of objective function values {F(x k )} also converges. 

Lemma 3.8 Suppose that F is bounded below in X and uniformly continuous in the the level 
set £ = {x e X : F(x) < F(x )}. Then, the sequence {x k } generated by the nonmonotone 
gradient method I satisfies \\m k ^ 00 d k = 0. Moreover, the sequence {F(x k )} converges. 

Proof. We first observe that {x k } C £. Let l{k) be an integer such that [fc— M] + < l(k) < k 
and 

F{x m ) = max{F(x 4 ) : [fc - M} + < i < k} 

for all fc > 0. We clearly observe that F(x k+1 ) < F(x 1 ^) for all fc > 0, which together with 
the definition of /(fc) implies that the sequence {F(x 1 ^)} is monotonically nonincreasing. 
Further, since F is bounded below in X, we have 

lim F(x m ) = F* (38) 

for some F* £ 3?. We next prove by induction that the following limits hold for all j > 1: 

lim d l(k) ~ j = 0, lim F(x l{k) - 3 ) = F* . (39) 

k^oo k— >oo 

Using (|3TT) and ( 1321) with fc replaced by Z(fc) — 1, we obtain that 

F(xW) < F(x 1 ^-^) + aA m ^. (40) 

Replacing fc and a by /(fc) — 1 and ai^)-i i n ([36]) . respectively, and using H^k)-i b an d 
the definition of Auk)-i ( see ( 132|) ). we have 

A, (fc )_! < -\a m ^\\d l ^- l \\ 2 . 

The above two inequalities yield that 

F(x m ) < F(x im - 1] ) - aXat^WS^W 2 , (41) 

which together with (|38|) implies that lim^oo a;(A;)-i||^^ A: ' ) ~ 1 || 2 = 0. Further, noticing that 
ctk>CL for all fc, we obtain that lim^oo d 1 ^^ 1 = 0. Using this result and (|38|) . we have 

lim Fix 1 ^- 1 ) = lim F(x m - d m ~ l ) = lim F{x l{k) ) = F*, (42) 

k-^oo k—*oo k— >oo 

where the second equality follows from uniform continuity of F in £. Therefore, fl39l) holds 
for j = 1. We now need to show that if ( 139]) holds for j, then it also holds for j + 1. Using a 
similar argument as that leading to (EH), we have 

F(x l <V- j ) < FfrWQ-'-V) -«7Aa /(fc )_ i _i||d ,(fc) --'- 1 || 2 , 
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which together with (1381) . the induction assumption lim^oo F(x l ^ k '~ : ') = F*, and the fact 
that aj(fc)_j_i > a for all k, yields lim^oo c?H fe ) - J -1 = 0. Using this result, the induction 
assumption lim^oo F{x l<yk ^'-') = F*, and a similar argument as that leading to ( 1421) . we can 
show that lim fc ^ 00 F(x i(fe) ^'- 1 ) = F*. Hence, (ESJ) holds for j + 1. 

Finally, we will prove that lim^oo d fe = and linn^oo F(x ) = F*. By the definition of 
1(h), we see that for k > M + 1, k — M — 1 = /(&;) — j for some 1 < j < M + 1, which together 
with the first limit in (1391) . implies that lim^ 00 <i' c = lim^oo d k ^ AI ^ 1 = 0. Additionally, we 
observe that 

h 



x Kk) = X k-M 



~ 1 + J2 dm ~ j Vfc>M + l, 



where 1% = l(k) — (k — M — 1) < M + 1. Using the above identity, fl39|) . and uniform continuity 
of F in C, we see that lim^-^ F(x k ) = lim^oo F(x k ^ M ^ 1 ) = F*. Thus, the conclusion of this 
lemma holds. ■ 

We are now ready to show that the nonmonotone gradient method I is globally convergent. 

Theorem 3.9 Suppose that F is bounded below in X and uniformly continuous in the level set 
C = {x G X : F(x) < F(x )}. Then, any accumulation point of the sequence {x k } generated 
by the nonmonotone gradient method I is a stationary point of 

Proof. Suppose for contradiction that x* is an accumulation point of {x k } that is a non- 
stationary point of (1261) . Let K be the subsequence such that {x k } k eK ~ * x * '■ We first claim 
that {ak}keK is bounded. Suppose not. Then there exists a subsequence of {a k } k £K that 
goes to oo. Without loss of generality, we assume that {a k } k( zK —> oo. For simplicity of 
notations, let a k = a k /r], d k (a) = d Hk ^{x k ) forkeK and a > 0, where H k (a) = aH k . 
Since {a k } k& x oo and a° k < a, there exists some index k > such that a k > a° k for all 
k G K with k >k. By the particular choice of a k specified in steps (2a) and (2b), we have 

F{x k + d k {a k ))> max F{x l ) + a[V f{x k ) T d k {a k ) + P(x k + d k {a k )) - P{x k )], (43) 

[k-M]+<i<k 

Using a similar argument as that leading to (1331) . we have 

-\\^f-\\\< 2F'(x k ,d k (a k )/\\d k (a k )\\) 

ak\\d (Qtfc)H < t— r Vfc G A , 

which along with the relations H k y XI and {x k } k€ x — ► x*, implies that {aijfc||d A (d!fc)||}fc e i<: is 
bounded. Since {a k } k£ x — > oo, we further have {Hg^o^)!!}^^ — > . We now claim that 

liminf a k \\d k (a k )\\ > 0. (44) 

k£K,k— *oo 

Suppose not. By passing to a subsequence if necessary, we can assume that {a k \\d k (a k ) || } k ^K — > 
0. Invoking that d k (a k ) is the optimal solution of ( 1281) with x = x k and H = a k H k , we have 

G V/(x fc ) + a k H k d k (a k ) + <9P(x fc + rf fc (o fc )) + N x (x k + d k (a k )) \/k G K. 
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Upon taking limits on both sides as k G K — > oo, and using semicontinuity of dP(-) and 
A0c(0 (see Theorem 24.4 of [23] and Lemma 2.42 of [53]). the relations XI ^ H k ^ XI, 
{\\d k (a k )\\} keK -> 0, {a fe ||d fc (a fc )||} fce x -> and {x fc } fcei ^ -> x*, we see that (E"7J) holds at 
x*, which contradicts nonstationarity of x*.Thus, (14*41 holds. Now, using (14*41) . the relation 
H k b A^j boundedness of {cifc || c? fc (<Sfc) || j-fce^? an d a similar argument as that leading to ( 1351) . 
we observe that a k d k (a k ) T H k d k (a k ) = 0(||<i fc (a:fc)||) as k G if — > oo. Using this result and a 
similar argument as that leading to f[37j) . we have 

F{x k + d k {a k )) < max F{x l ) + a[V/(x fc ) T ^(a/0 + P{x k + d k {a k )) - P(x fc )], 

[k-M] + <i<k 

provided that k E K is sufficiently large. The above inequality evidently contradicts (|43p . 
Thus, {afcjfce^ is bounded. 

Finally, invoking that d k = d k (a k ) is the optimal solution of (f2"g]) with x = x k , H = a k H k , 
we have 

OG V/(x fc ) + a fe ^ fc + 5P(x fe + c/ fc ) + AT x (x fc + c/ fc ) V/c G if. (45) 

By Lemma [3781 we have {c? fc j-fce^ft" — * 0. Upon taking limits on both sides of (j4"5l) as k G if — > oo, 
and using semicontinuity of dP(-) and Nx(-) (see Theorem 24.4 of [23J and Lemma 2.42 of 
[24]). and the relations XI ^ H k ^ XI, {d k } k€K — ► and {x fe } fce ^ — > a;*, we see that (1271) 
holds at x*, which contradicts the nonstationarity of x* that is assumed at the beginning of 
this proof. Therefore, the conclusion of this theorem holds. ■ 

We next analyze the asymptotic convergence rate of the nonmonotone gradient method I 
under the following assumption, which is the same as that made in [26]. In what follows, we 
denote by X the set of stationary points of problem (1261) . 

Assumption 2 (a) X ^ and, for any ( > min^gx P{x), there exists r > and e > 
such that 

dist(x, X) < r\\di(x)\\ whenever F(x) < £, ||dj(a;)|| < e. 
(b) There exists 5 > such that 

\\x — y\\ > 5 whenever x G X, y G X , F(x) ^ F(y). 

We are ready to establish local linear rate of convergence for the nonmonotone gradient 
method I described above. The proof of the following theorem is inspired by the work of Tseng 
and Yun [26], who analyzed a similar local convergence for a coordinate gradient descent 
method for a class of nonsmooth minimization problems. 

Theorem 3.10 Suppose that f satisfies (E2P, and F is bounded below in X and uniformly 
continuous in the level set C = {x G X : F(x) < F(x )}. Then, the sequence {x k } generated 
by the nonmonotone gradient method I satisfies 

F(x m )-F* < c(F(x l( - 1 ^-^ - F*), 

provided k is sufficiently large, where F* = lim^oo F(x k ) (see Lemma \3. 8\) . and c is some 
constant in (0, 1). 
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Proof. Invoking a k < at and the specific choice of we see from Lemma 13.6( d) that 
d := sup fc afc < oo. Let H^{a) = ali k . Then, it follows from XI ^ -< XI and > a that 
(a - X)I z< H k (ak) z< aXI. Using this relation, Lemma |375| H k >z XI, and d k = dH k (a k ){x k ), we 
obtain that 

11^(^)11=6(11^11), (46) 

which together with Lemma 13.81 implies {di(x k )} — ► 0. Thus, for any e > 0, there exists 
some index k such that di(x l ^~ 1 ) < e for all fc > fc. In addition, we clearly observe that 
F(x l ^ k ^ 1 ) < F(x°). Then, by Assumption [2(a) and f|46l) . there exists some index k! such that 

UaJW-i _ T Kk)-i || < Ci ||^(fc)-i || yk > k > ( 47 ) 
for some c\ > and x 1 ^^ 1 E X. Note that 

i(k+i)-2 

||^+i)-i_ x W-i|| < ^ ||^||< ^ ||^|| ; 

i=i(fe)-l i=[jfc-iW-l]+ 

which together with {d k } — > 0, implies that ||x^ A:+1 ^ 1 — x'^^ 1 )! — > 0. Using this result, (I4"7j) . 
and Lemma [3.8[ we obtain 

H^/(fc+i)-i _ ^/(fe)-iii < |i„i(fe+i)-i _ ^/(fe+i)-iii i m i(fc)-i _ ^/(fc)-iii , ii„/(fc+i)-i _ =j(fc)-i|| 

< c 1 ||^ fe+1 )- 1 || +c 1 ||rf'( fc )- 1 || + -x l ^- l \\ -> 0. 

It follows from this relation and Assumption 12(b) that there exists an index fc > fc' and t> G 9? 
such that 

=v Vfc > fc. (48) 

Then, by Lemma 5.1 of [26j, we see that 

F* = lim F{x k ) = liminf F(x i(fc)_1 ) > v. (49) 

fc— >oo fc— >oo 

Further, using the definition of F, ([301) . (jl8|) . Lemma [3761 (b) . and Hk(ak) ^ dA/, we have for 
fc > fc, 

F(x' (fe) )-i; = /(^ fe) ) +P(^ fe) ) -fix 1 ^- 1 ) -P(x^)- 1 ) 

= V /(x fc ) T (x^ - x 1 ^- 1 ) + P(x l W) - P(x 1 ^- 1 ) 

= (V/(* fc ) - Vf(xW- l f(x m ~ s^" 1 ) - (ff I (*)-i(a«(*)-i)d' ( * ) - 1 ) T (^* ) - *' (fc) 

+ ^/(x'W- 1 ) + ff I(Jfc) _ 1 (a w _ 1 )d'«- 1 ) T (x IW - * l(kyi ) + " P (x Kk) - 

< L\\x k -x l ^- l \\\\x 1 ^ -x l ^~ x \\ + aA||d , W- 1 ||||x'W -x'W- 1 !!, 

where x fc is some point lying on the segment joining :r^ fc ) with ac'W- 1 . It follows from (JUj) 
that, for fc > fc, 

Ha* _ || < \\ x m _ ^w-i || + \\ x m-i _ x uk)-i || = (1 + Cl) ||^w-i || . 
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Similarly, \\x l( - k ^ — x 1 ^ x || < (1 + ci)||d z ( fc ) 1 || for k > k. Using these inequalities, Lemma 
13.6( a). H k (a k ) y (a ■ X)I, and (!50|) . we see that for k > k, 

F(x l ^)-v < -c 2 A l{kyi 

for some constant > 0. This inequality together with (1401) gives 

F(x l W)-v < c 3 (F(x im - 1] )-F(x m )) \/k>k, (51) 

where c 3 = C2/0". Using lim^oo F(x 1 ^) = F*, and upon taking limits on both sides of ( 15T1) . 
we see that F* < v, which together with (149p implies that v = F*. Using this result and upon 
rearranging terms of flSTl) . we have 

F( x l{k) ) - F* < c(F(x' (/(fc)) - 1) - F*) Wk > k, 

where c = c 3 / (1 + c 3 ). ■ 

We next present the second nonmonotone gradient method for (1261) as follows. 
Nonmonotone gradient method II: 

Choose parameters < 77 < 1, 0<<7<1, < a < a, 0<A<A, and integer M > 0. Set 
k = and choose x° G X . 

1) Choose XI ^H k ^ XL 

2) Solve (1281) with x = x k and i7 = //fc to obtain d k = d^(x), and compute according 
to dH2}. 

3) Choose a£ G [a, a]. Find the smallest integer j > such that = oi^rf satisfies 

F(x k + a k d k ) < max Fix*) + aa k A k , (52) 

[k-M] + <i<k ' 

where A k is defined in ( 1521 . 

4) Set x k+1 = x k + a k d k and k <— k + 1. 
end 

Remark. The above method is closely related to the one proposed in |26j. When the 
entire coordinate block, that is, J = {1, . . . , n} is chosen for the latter method, it becomes a 
special case of our method with M — 0, which is a gradient descent method. Given that our 
method is generally a nonmonotone method especially when M > 1, most proofs of global 
and local convergence for the method |26j do not hold for our method directly. In addition, 
our method can be viewed as an extension of the second projected gradient method (namely, 
SPG2) studied in [1] for smooth problems, but the method [26] generally cannot. ■ 
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We next prove global convergence of the nonmonotone gradient method II. Before pro- 
ceeding, we establish two technical lemmas below. The first lemma shows that if x k G X is a 
nonstationary point, there exists an a k > in step 3) so that ( l52l) is satisfied, and hence the 
above method is well defined. 

Lemma 3.11 Suppose that H k y and x k G X is a nonstationary point of problem If2h}) . 
Then, there exists a > such that d k = dn k {x k ) satisfies (EIJ) whenever < a k < a. 

Proof. In view of Lemma 2.1 of [26] with J = {1, . . . , n}, c = 1, x = x k , and H = H k , we 
have 

F(x k + ad k ) < F(x k ) + aA k + o{a) 

< max Fix 1 ) + aA k + o(a) VaG (0,1], 

[k-M] + <i<k 

where is defined in (1321) . Using the assumption of this lemma, we see from Lemma 13.41 
that d k 0, which together with H k >- and Lemma [3.6( a) implies < 0. The conclusion 
of this lemma immediately follows from this relation and the above inequality. ■ 

The following lemma shows that the scaled search directions {a k d k } approach zero, and 
the sequence of objective function values {F(x k )} also converges. 

Lemma 3.12 Suppose that F is bounded below in X and uniformly continuous in the level 
set L = {x G X : F(x) < F(x )}. Then, the sequence {x k } generated by the nonmonotone 
gradient method II satisfies lim^oo a k d k = 0. Moreover, the sequence {F(x k )} converges. 

Proof. Let l(k) be defined in the proof of Lemma 13.81 We first observe that {x k } C £. 
Using ( 1321 . the definition of d k , and H k y XI, we have 

A k = Vf(x k ) T d k + P(x k + d k ) - P(x k ) < --(d k ) T H k d k < --A||rf fc || 2 , (53) 

which together with the relation a k < a k < a, implies that 

al\\d k \\ 2 < -2aa k A k /X. (54) 

By a similar argument as that leading to ( 1381) . we see that {x k } satisfies ( 1381) for some F*. 
We next show by induction that the following limits hold for all j > 1: 

lim a l(k) .jd m - j = 0, lim F(x l{k) - ] ) = F*. (55) 

k^oo k — >oo 

Indeed, using (I5^|) with k replaced by l(k) — 1, we obtain that 

F{x 1 ^) < F(x 1 ^^) + aa m _ x A m _ x . 

It together with (138!) immediately yields limfc^oo ai( k )-iA^ k ^i = 0. Using this result and (|54|) . 
we see that the first identity of (|55|) holds for j = 1. Further, in view of this identity, (138|) . 
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and uniform continuity of F in £, we can easily see that the second identity of (1551) also holds 
j = 1. We now need to show that if (155]) holds for j, then it also holds for j + 1. First, it 
follows from (J52l) that 

< FCx'CW-^ 1 )) + < 7a, (fc) _ i _iA I(fc) _ i _ 1 , 

which together with (138]) and the induction assumption that lim^oo F(x l ^ k '~-') = F*, yields 
linifc^oo ati( k )-j-iAi(k)-j-i = 0. Using this result and (jMD, we have lim fe _,oo a^-j.-id^ - ^ -1 = 
0. In view of this identity, uniform continuity of F in C and the induction assumption 
Hindoo F(x l ^~ j ) = F*, we can easily show that lim^oo F(x l ^~ :i ~ l ) = F*. Hence, (1551) 
holds for j + 1. The conclusion of this lemma then follows from fl55|) and a similar argument 
as that in the proof of Lemma 13.121 ■ 

We are now ready to show that the nonmonotone gradient method II is globally convergent. 

Theorem 3.13 Suppose that F is bounded below in X and uniformly continuous in the level 
set C = {x G X : F(x) < F(x )}. Then, any accumulation point of the sequence {x h } 
generated by the nonmonotone gradient method II is a stationary point of If2h]) . 

Proof. Suppose for contradiction that x* is an accumulation point of {x k } that is a non- 
stationary point of ( l26l) . Let K be the subsequence such that {x k }keK — * %*• We first claim 
that liminffcg^fc^oo ||<i fe || > 0. Suppose not. By passing to a subsequence if necessary, we can 
assume that {||<i fe ||}fcgA' ~~ * 0. Invoking that d k is the optimal solution of (1251) with x = x h and 
H = H k , we have 

G Vf{x k ) + H k d k + dP{x k + d k ) + N x (x k + d k ) \fk G K. 

Upon taking limits on both sides as k G K — > oo, and using semicontinuity of dP(-) and Nx{-) 
(see Theorem 24.4 of [23J and Lemma 2.42 of [21]) the relations XI ^ H k ^ AI, {||rf fc ||}fceA -> 
and {x k }k£K x*, we see that ( 1271) holds at x*, which contradicts the nonstationarity of x*. 
Thus, liminffcgAfc^oo \\d k \\ > holds. Further, using a similar argument as that leading to 
(13"3"j) . we have 

m<- 2F ' { f' d * /m) VkeK, 

Amin(-^fc) 

which together with {x k } ke K x*, H k >z A/ and liminf^g^fc^oo IM fc || > 0, implies that 
||d fc || = 6(1) for k G if. Further, using (153]) . we see that limsup fcgii - A fc < 0. Now, it 
follows from Lemma 13 . 121 and the relation liminffcg^^oo \\d k \\ > that {a k } k£ x - > 0. Since 
a k — Q. > 0' there exists some index k > such that < a° and a k < r] for all fc G if with 
k > k. Let a k = a k /rj. Then, {a k }k£K — > and < a k < 1 for all k E K. By the stepsize 
rule used in step (3), we have, for all k G K with k >k, 

F(x k + a k d k ) > max Fix 1 ) + aa k A k , (56) 

[fc-M] + <«<fc 
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On the other hand, in view of the definition of F, (132]) . the relations \\d \\ = @(1) and 
lim sup fcg ^ A k < 0, and the monotonicity of (P(x k + ad k ) — P(x k ))/a, we obtain that, 
for sufficiently large k £ K, 

F(x k + a k d k ) = f(x k + a k d k )+P(x k + a k d k ) 

= f(x k + a k d k ) - f(x k ) + P(x k + a k d k ) - P(x k ) + F(x k ) 

= a k Vf{x k ) T d k + o{a k \\d k \\) + P{x k + a k d k ) - P{x k ) + F{x k ) 

< a k Vf(x k ) T d k + o(a k ) + a k \P(x k + d k )-P(x k )}+ max Fix*) 

[k-M] + <i<k 

= max Fix 1 ) + a k A k + o(a k ) 

[k-M] + <i<k 

< max Fix 1 ) + aa k A k , 

[k-M]+<i<k 

which clearly contradicts ( 1561) . Therefore, the conclusion of this theorem holds. ■ 

We next establish local linear rate of convergence for the nonmonotone gradient method 
II described above. The proof of the following theorem is inspired by the work of Tseng and 
Yun [261. 



Theorem 3.14 Suppose that a < 1, f satisfies (3(a) . and F is bounded below in X and 
uniformly continuous in the level set £ = {x £ X : F(x) < F(x )}. Then, the sequence {x k } 
generated by the nonmonotone gradient method II satisfies 

F(x m )-F* < c(F(x l ^ k ^~^ — F*) 

provided k is sufficiently large, where F* = linn^oo F(x k ) (see Lemma \3.1ty) . and c is some 
constant in (0, 1). 

Proof. Since a k is chosen by the Armijo rule with a k > a > 0, we see from Lemma [3.6( c) 
that mf k a k > 0. It together with Lemma \3. 121 implies that {d k } — ► 0. Further, using Lemma 
E3]and the fact that d k = d Hk (x k ) and XI ^ H k z< XI, we obtain that ||rf/(x fc )|| = e(||d fe ||), 
and hence {di(x k )} — > 0. Then, by a similar argument as that in the proof of Theorem 13.101 
there exist c\ > 0, v £ 5i, and x 1 ^' 1 £ X such that 



x 



< Cl \\d l ^-\ F{x l ^~ l )=v Vk>k, 



where k is some index. Then, by Lemma 5.1 of [26], we see that ([49]) holds for {x k }, and the 
above F* and v. Further, using the definition of F, ( I301) . Lemma ISToT b). and XI ^ H k ^ XI, 
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we have, for k > k, 

F(x m )-v = f(x m ) + P(x m ) - f(x l{k) - 1 ) - P(x l{k) - 1 ) 

= Vf{x k ) T {x m - x 1 ^- 1 ) + P(x 1 ^) - P(x m - 1 ) 

= (V/(x fe ) - V/V^- 1 ) V« - ^ {k) ' 1 ) - (ifyfcj-idW- 1 ) V (fc) - x 1 ^- 1 ) 
+ [{Vf{x 1 ^- 1 ) + Ht^dW-YixW - x^- 1 ) + P(x 1 ^) - P^- 1 )] 

< L\\x k - x m ~ l \\\\x m - x^'H + A||d i(A;)_1 ||||x i(fe) - 

+(«,(*)_! - 1) [(d'W- 1 ) T J ff,( fc) _id , ( fc >- 1 + , (57) 

where x k is some point lying on the segment joining x l W with x m -\ It follows from (ED 
and afc < 1 that, for k > k, 

ll^-x^)- 1 !! < Hx'W-x'^ii + iix'w- 1 -^- 1 !! ^(l + con^w- 1 !!. 

Similarly, ||a;^ fc ) — < (1 + ci)||d^ fc ) -1 || for k > k. Using these inequalities, Lemma 

13.6( a). Hk y XI, oik < 1, and (157jl . we see that, for k > k, 

F(x l M)-v < -c 2 A l(kyi 

for some constant c 2 > 0. The remaining proof follows similarly as that of Theorem 13. 101 ■ 

4 Augmented Lagrangian method for sparse PCA 

In this section we discuss the applicability and implementation details of the augmented 
Lagrangian method proposed in Section [3] for solving sparse PCA ([3]). 



4.1 Applicability of augmented Lagrangian method for ([3]) 

We first observe that problem (J3j) can be reformulated as 

min -Tr(V T t,V) + p • |V| 

S.t. V^YlVj < Ay V^j, 

-tf EV- < Ay V* ^ j, 
V T V = /. 

Clearly, problem fl58l) has the same form as (Q. From Subsection 13.21 we know that the 
sufficient conditions for convergence of our augmented Lagrangian method include: i) a feasible 
point is explicitly given; and ii) Robinson's condition (TTUj) holds at an accumulation point. It 
is easy to observe that any V G 3? nxr consisting of r orthonormal eigenvectors of £ is a feasible 
point of (I58I) . and thus the first condition is trivially satisfied. Given that the accumulation 
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points are not known beforehand, it is hard to check the second condition directly. Instead, we 
may check Robinson's condition at all feasible points of (|58|) . However, due to complication of 
the constraints, we are only able to verify Robinson's condition at a set of feasible points below. 
Before proceeding, we establish a technical lemma as follows that will be used subsequently. 

Lemma 4.1 Let V G 3? nxr be a feasible solution of Given any W\, W 2 G S r , the system 
of 

SV T tV + V T t SV + SD = W 1 , (59) 

SV T V + V T SV = W 2 (60) 

has at least one solution (SV, SD) G 9ft nxr x T> r if one of the following conditions holds: 

a) V T Y,V is diagonal and V^TiVi 7^ V^'T.Vj for all i 7^ j; 

b) V T Tt{I — VV T )Ty is nonsingular. 

Proof. Note that the columns of V consist of r orthonormal eigenvectors. Therefore, there 
exist V G 3fJ nx( "'" r) such that [V V] G 5T xn is an orthogonal matrix. It follows that for 
any SV G 5T xr , there exists 5P G 9? rxr and SP G SR^W such that SV = VSP + VSP. 
Performing such a change of variable for SV, and using the fact that the matrix [V V] is 
orthogonal, we can show that the system of (J59~j) and (1601) is equivalent to 

SP T G + GSP + SP T G + G T SP + SD = W u (61) 

SP T + SP = W 2 , (62) 

where G = V T T,V and G = V T flV. The remaining proof of this lemma reduces to show that 
the system of fl£U) and ([62]) has at least a solution (SP, SP, SD) G 9fT xr x ^-r)xr x V r if one 
of conditions (a) or (b) holds. 

First, we assume that condition (a) holds. Then, G is a diagonal matrix and Ga 7^ Gjj for 
all i 7^ j. It follows that there exists a unique SP* G 3? nxr satisfying 5Pu = (W 2 )u/2 for all i 
and 

SPijGjj + GJP.j = (Wi) y Vz^j, 
5Pij + SP jt = (W 2 ) ij Vi^j. 

Now, let SP* = and SD* = Diag(Wi - GW 2 ). It is easy to verify that (5P*, SP*, SD*) is a 
solution of the system of (15"T1) and (1621 . 

We next assume that condition (b) holds. Given any SP G Jj( n - r ) xr ) there exist SY G 
^(n-rjxr and §%jz sfjrxr guch that qt 6 y = and = SY + GSZ. Peforming such a change 

of variable for SP, we see that f l6Tj) can be rewritten as 

5P T G + G5P + 5Z T G T G + G T GSZ + SD = W x . (63) 

Thus, it suffices to show that the system of (16"2"|) and (1631 has at least a solution (SP, SZ, SD) G 
3fJ rxr x 3ft rxr x P r . Using the definition of G and the fact that the matrix [V V] is orthogonal, 
we see that 

G T G = V T ±VV T ±V = V T ±(I - VV T )±V, 
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which together with condition (b) implies that G T G is nonsingular. Now, let 

5P* = W 2 /2, 5Z* = (G T G)- 1 (2W 1 -W 2 G -GW 2 )/4, 5D* = 0. 

It is easy to verify that (SP*, SZ*, SD*) is a solution of the system of (1631) and fl62|) . Therefore, 
the conclusion holds. ■ 

We are now ready to show that Robinson's condition (TlOT) holds at a set of feasible points 

of (EE}. 

Proposition 4.2 Let V G 9ft nxr fre a feasible solution of 155]) . TTie Robinson's condition UDi) 
holds at V if one of the following conditions hold: 

a) Aij = and VftVi ^ VjtVj for all % ^ j; 

b) There is at least one active and one inactive inequality constraint of ( f5g|) at V , and 
V T Y,(I — VV T )YiV is nonsingular; 

c) All inequality constraints of (58\) are inactive at V. 

Proof. We first suppose that condition (a) holds. Then, it immediately implies that V T T.V 
is diagonal, and hence the condition (a) of Lemma 14.11 holds. In addition, we observe that 
all constraints of fl58l) become equality ones. Using these facts and Lemma 14.11 we see that 
Robinson's condition ffTUl) holds at V. Next, we assume that condition (b) holds. It implies 
that condition (b) of Lemma H~T1 holds. The conclusion then follows directly from Lemma H~T1 
Finally, suppose condition (c) holds. Then, Robinson's condition ( {TO]) holds at V if and only 
if ([60]) has at least a solution 5V G 3fT xr for any W 2 G S r . Noting that V T V = I, we easily 
see that 5V = VW 2 /2 is a solution of (!60|) . and thus Robinson's condition (jTOj) holds at V. ■ 

From Proposition ^. 21 we see that Robinson's condition (|T0|) indeed holds at a set of feasible 
points of fl5E\i . Though we are not able to show that it holds at all feasible points of (1B^|) . 
we observe in our implementation that the accumulation points of our augmented Lagrangian 
method generally satisfy one of the conditions described in Proposition 14.21 and so Robinson's 
condition usually holds at the accumulation points. Moreover, we have never seen that our 
augmented Lagrangian method failed to converge for an instance in our implementation so 
far. 

4.2 Implementation details of augmented Lagrangian method for 

(EHD 

In this section, we show how our augmented Lagrangian method proposed in Subsection 13.21 
can be applied to solve problem fl58l) (or, equivalently, ([3])). In particular, we will discuss the 
implementation details of outer and inner iterations of this method. 
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We first discuss how to efficiently evaluate the function and gradient involved in our aug- 
mented Lagrangian method for problem (|58|) . Suppose that g > is a penalty parameter, 
and {\fj}i^j and {A^}^ are the Lagrangian multipliers for the inequality constraints of ( J58l) . 
respectively, and /i G S r is the Lagrangian multipliers for the equality constraints of ( 1581) . For 
convenience of presentation, let A G S r be the matrix whose r/'th entry equals the parameter 
Ajj of (158]) for all i ^ j and diagonal entries are 0. Similarly, let A + (resp., A") be an r x r 
symmetric matrix whose ijih entry is A^- (resp., A^) for all i ^ j and diagonal entries are 0. 
We now define A G §l 2rxr by stacking A + over A - . Using these notations, we observe that the 
associated Lagrangian function for problem ( 1581) can be rewritten as 

(64) 



L e {V,\,fi) = w{V)+pm \V\ 



where 



w(V) = -Tr(V T TV) + 



2g 



x- 

X' 



S-A 
-S-A 



A H 
A" 



and 



S = V 1 TV - Diag(V J TV) 



R = V 1 V - I. 



(65) 



It is not hard to verify that the gradient of w(V) can be computed according to 

Vw(V) = 2 (-tV (I - [A + + gS - gA} + + [A - - gS - gA} + ) + V(p + gR) 

Clearly, the main effort for the above function and gradient evaluations lies in computing 
V T EV and TV. When E e S p is explicitly given, the computational complexity for evaluating 
these two quantities is 0(p 2 r). In practice, we are, however, typically given the data matrix 
X G 3? nxp . Assuming the column means of X are 0, the sample covariance matrix E can be 
obtained from E = X T Xj (n — 1). Nevertheless, when p > n, we observe that it is not efficient 
to compute and store E. Also, it is much cheaper to compute V T TV and TV by using E 
implicitly rather explicitly. Indeed, we can first evaluate XV, and then compute V T TV and 
TV according to 

V T TV = (XV) T (XV)/(n - 1), TV = X T (XV)/(n - 1). 

Then, the resulting overall computational complexity is 0(npr), which is clearly much superior 
to the one by using E explicitly, that is, 0{p 2 r). 

We now address initialization and termination criterion for our augmented Lagrangian 
method. In particular, we choose initial point V^ it and feasible point V icas to be the loading 
vectors of the r standard PCs, that is, the orthonormal eigenvectors corresponding to r largest 
eigenvalues of E. In addition, we set initial penalty parameter and Lagrangian multipliers to 
be 1, and set the parameters r = 0.2 and a = 10. We terminate our method once the 
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constraint violation and the relative difference between the augmented Lagrangian function 
and the regular objective function are sufficiently small, that is, 

riT/TviT/l A 1+ s ID 1^ \Lq{V, A, /i) — /(V) 

max V, EVj\ - A^] + < e h max i2y < e E , TTJrtKi T\ - e °' ( 66 ) 

hi max {\j{V ) |, 1 J 

where /(V) = — Tr(V T S^) + p • |V|, -R is defined in ( 1651) . and ej, e^, eo are some pre- 
scribed accuracy parameters corresponding to inequality constraints, equality constraints and 
objective function, respectively 

We next discuss how to apply the nonmonotone gradient methods proposed in Section 13.31 
for the augmented Lagrangian subproblems, which are in the form of 

minL e (V,A,/x), (67) 

where the function L g (-,X, p) is defined in (16^1) . Given that the implementation details of 
those nonmonotone gradient methods are similar, we only focus on the second one, that is, 
the nonmonotone gradient method II. First, the initial point for this method can be chosen 
according to the scheme described at the end of Subsection 13.21 In addition, given an iterate 
V k , the search direction d k is computed by solving subproblem fl28|) with H = ct^ 1 /, which 
becomes, in the context of (1211) and 



d k := argmin <^ Vw{V k ) • d+ \\d\\ 2 F + p • \V k + d\ \. (68) 

d y 2a k J 

Here, a k > is chosen according to the scheme proposed by Barzilai and Borwein [2], which 
is also used by Birgin et al. [I] for studying a class of projected gradient methods. Let 
< a min < a max be given. Initially, choose an arbitrary a G [a min , a max ]. Then, a k is 
updated as follows: 



"max, if h < 0; 

max{a min , min{a max , a k /b k }}, otherwise, 



where a k = \\V k - V k ~ l \\ 2 F and b k = (V k - V k ~ l ) • (Ww(V k ) - Vw(y fc " 1 )). It is not hard to 
verify that the optimal solution of problem (|68p has a closed-form expression, which is given 
by 

d fc = sign(C)0 [\C\-a k p} + -V k , 

where C = V k — a Viu (V ). In addition, we see from Lemma 13^1 that the following termination 
criterion is suitable for this method when applied to ( 1671) : 

maxy \di(V k )\ij 



max(\L ( ,(V k ,X,p)\ } l) 



where di(V k ) is the solution of (l6"8"j) with a k = 1, and e is a prescribed accuracy parameter. 
In our numerical implementation, we set a = 1/ max^ \di(V°)\ij, a max = 1, a min = 1CT 15 and 
e = 10" 4 . 
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Table 1: Sparse PCA methods used for our comparison 



GPower; 1 


Single-unit sparse PCA via Zi-penalty 


GPower; 


Single-unit sparse PCA via Zo-penalty 


GPower; , m 


Block sparse PCA via Zi-penalty 


GPower; 


Block sparse PCA via Zo-penalty 


DSPCA 


DSPCA algorithm 


SPCA 


SPCA algorithm 


rSVD 


sPCA-rSVD algorithm with soft thresholding 


ALSPCA 


Augmented Lagrangian algorithm 



Finally, it shall be mentioned that for the sake of practical performance, the numerical 
implementation of our augmented Lagrangian method is slightly different from the one de- 
scribed in Subsection 13.21 In particular, we follow a similar scheme as discussed on pp. 405 
of [3] to adjust penalty parameter and Lagrangian multipliers. Indeed, they are updated sep- 
arately rather than simultaneously. Roughly speaking, given 7 e (0, 1), we adjust penalty 
parameter only when the constraint violation is not decreased by a factor 7 over the previous 
minimization. Similarly, we update Lagrangian multipliers only when the constraint viola- 
tion is decreased by a factor 7 over the previous minimization. We choose 7 = 0.25 in our 
implementation as recommended in [3]. 

5 Numerical results 

In this section, we conduct numerical experiments for the augmented Lagrangian method 
detailed in Subsections 13.21 and 14.21 for formulation (158]) (or, equivalently, (j3J)) of sparse PCA 
on synthetic, random, and real data. In particular, we compare the results of our approach 
with several existing sparse PCA methods in terms of total explained variance, correlation 
of PCs, and orthogonality of loading vectors, which include the generalized power methods 
(Journee et al. [16j), the DSPCA algorithm (d'Aspremont et al. [8]), the SPCA algorithm 
(Zou et al. [25]). and the sPCA-rSVD algorithm (Shen and Huang [25] )• We now list all the 
methods used in this section in Table [TJ Specifically, the methods with the prefix 'GPower' are 
the generalized power methods studied in [16j, and the method ALSPCA is the augmented 
Lagrangian method proposed in this paper. 

As discussed in Section [2] the PCs obtained from the standard PCA based on sample 
covariance matrix £ e $l nxp are nearly uncorrelated when the sample size is sufficiently large, 
and the total explained variance by the first r PCs approximately equals the sum of the 
individual variances of PCs, that is, Tt(V t T 1 V), where V G $Z pxr consists of the loading 
vectors of these PCs. However, the PCs found by sparse PCA methods may be correlated 
with each other, and thus the quantity Tr(V T tlV) can overestimate much the total explained 
variance by the PCs due to the overlap among their individual variances. In attempt to deal 
with such an overlap, two adjusted total explained variances were proposed in [28J [25]. It 
is not hard to observe that they can be viewed as the total explained variance of a set of 
transformed variables from the estimated sparse PCs. Given that these transformed variables 
can be dramatically different from those sparse PCs, their total explained variances may 
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differ much from each other as well. To alleviate this drawback while taking into account the 
possible correlations among PCs, we introduce the following adjusted total explained variance 
for sparse PCs: 

AdjVarV = Tr(V T ±V) - /^(Vf EV^-) 2 - 

v 

Clearly, when the PCs are uncorrelated, it becomes the usual total explained variance, that 
is, Tt(V t HV). We also define the cumulative percentage of adjusted variance (CPAV) for the 
first r sparse PCs as the quotient of the adjusted total explained variance of these PCs and 
the total explained variance by all standard PCs, that is, AdjVarV A /Tr(S). 

Finally, we shall stress that the sole purpose of this section is to compare the performance 
of those methods listed in Table [I] for finding the sparse PCs that nearly enjoy the three 
important properties possessed by the standard PCA (see Section [Q. Therefore, we will not 
compare the speed of these methods. Nevertheless, it shall be mentioned that our method, 
that is, ALSPCA, is a first-order method and capable of solving large-scale problems within 
a reasonable amount of time as observed in our experiments. 

5.1 Synthetic data 

In this subsection we use the synthetic data introduced by Zou et al. [28] to test the effec- 
tiveness of our approach ALSPCA for finding sparse PCs. 
The synthetic example |28j considers three hidden factors: 

V x ~ N (0,290), V 2 ~ N (0,300), V 3 = -0.3K + 0.9251/ 2 + e, e~JV(0,l), 

where V\, V 2 and e are independent. Then the 10 observable variables are generated as follows: 

X l = V 1 + e] } eJ~iV(0,l), i = 1,2,3,4, 

X i = V 2 + el e 2 ~iV(0,l), 2 = 5,6,7,8, 

X t = V 3 + el e, 3 ~iV(0,l), i = 9, 10, 

where e\ are independent for j = 1, 2, 3 and % = 1, . . . , 10. We will use the actual covariance 
matrix of (X 1; . . . , X 10 ) to find the standard and sparse PCs, respectively. 

We first see that V± and V 2 are independent, but V3 is a linear combination of V\ and V 2 . 
Moreover, the variances of the three underlying factors V\, V 2 and V 3 are 290, 300, and 283.8, 
respectively. Therefore V 2 is slightly more important than V\, and they both are much more 
important than V3. In addition, the first two standard PCs together explain 99.72% of the 
total variance (see Table [2]). These facts suggest that the first two sparse PCs be sufficient 
to explain most of the variance. Ideally, the first sparse PC recovers the factor V 2 only using 
(X 5 , X 6 , X 7 , X 8 ), and the second sparse PC recovers the factor V\ only using (X 1; X 2 , X 3 , X 4 ). 
Moreover, given that (X 5 , X 6 , X 7 , X s ) and (Xi, X 2 , X 3 , X 4 ) are independent, these sparse PCs 
would be uncorrelated and orthogonal each other. 



33 



Table 2: Loadings of the first two PCs by standard PCA and ALSPCA 



Variable 


PCA 


ALSPCA 




PCI 


PC2 


PCI 


PC2 


Xi 


0.1158 


0.4785 





0.5000 


x 2 


0.1158 


0.4785 





0.5000 


x 3 


0.1158 


0.4785 





0.5000 


x 4 


0.1158 


0.4785 





0.5000 


x 5 


-0.3955 


0.1449 


-0.5000 





x 6 


-0.3955 


0.1449 


-0.5000 





X r 


-0.3955 


0.1449 


-0.5000 





X S 


-0.3955 


0.1449 


-0.5000 





x 9 


-0.4005 


-0.0095 








Xio 


-0.4005 


-0.0095 








CPAV (%) 


99.72 


80.46 



Synthetic data 



In our test, we set r = 2, A^- = for all i ^ j, and p = 4 for formulation (1581) of sparse 
PCA. In addition, we choose (1661) as the termination criterion for ALSPCA with ti = eo = 0.1 
and €e = 10~ 3 . The results of standard PCA and ALSPCA for this example are presented 
in Table [3 The loadings of standard and sparse PCs are given in columns two and three, 
respectively, and their CPAVs are given in the last row. We clearly see that our sparse PCs 
are consistent with the ones predicted above. Interestingly, they are identical with the ones 
obtained by SPCA and DSPCA reported in [28l [8]. For general data, however, these methods 
may perform quite differently (see Subsection 15.31) . 

5.2 Random data 

In this subsection, we compare the performance of the GPower methods [16] and our ALSPCA 
method for finding sparse PCs on a set of randomly generated data. First, we randomly 
generate 100 centered data matrices X with the size of 20 x 20. Throughout this subsection, 
we set Ajj = 0.5 for all i ^ j for formulation (jSHD of sparse PCA, and choose (16"61) as the 
termination criterion for ALSPCA with ej = 0.1, = 0.1 and eo = 0.1. 

In the first test, we aim to find the first three sparse PCs with the average number of 
zero loadings around 30 (50% sparsity). To meet this purpose, the tunning parameter p for 
problem f[5^|) and the parameters for the GPower methods are properly chosen. The results of 
the GPower methods and ALSPCA for the above randomly generated instances are presented 
in Table [3j The name of each method is given in column one. The sparsity measured by the 
number of zero loadings averaged over all instances is given in column two. The column three 
gives the average amount of non-orthogonality of the loading vectors, which is measured by 
the maximum absolute difference between 90° and the angles formed by all pairs of loading 
vectors. Clearly, the smaller value in this column implies the better orthogonality. In addition, 
the maximum correlation and CPAV of sparse PCs averaged over all instances are presented 
in columns four and five, respectively. From Table [31 we see that the average number of zero 
loadings of the first three sparse PCs for all methods are almost same, which are around 30. We 
also observe that the sparse PCs given by our method ALSPCA are almost uncorrelated and 
their loading vectors are nearly orthogonal, which are much superior to the GPower methods. 
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Table 3: Comparison of GPower and ALSPCA 



Method 


Sparsity 


Non-orthogonality 


Correlation 


CPAV (%) 


GPower^ 
GPower i() 


30.73 


3.480 


0.104 


39.06 


30.37 


3.540 


0.111 


38.74 


GPower ilim 


30.61 


5.220 


0.161 


38.12 


GPower (0jm 


30.51 


5.216 


0.153 


37.99 


ALSPCA 


30.91 


1.007 


0.038 


40.79 



Random data: Test I 



Table 4: Comparison of GPower and ALSPCA 



Method 


Sparsity 


Non-orthogonality 


Correlation 


CPAV (%) 


GPower^ 


60.58 


5.766 


0.168 


63.46 


GPower; Q 


60.53 


6.031 


0.169 


63.27 


GPower il>m 


60.02 


8.665 


0.265 


62.34 


GPower io>m 


60.14 


9.146 


0.272 


62.00 


ALSPCA 


60.74 


1.177 


0.066 


64.62 



Random data: Test II 



Moreover, our sparse PCs have better CPAV than the others. 

Our aim of the second test is to find the first six sparse PCs with the average number of zero 
loadings around 60 (50% sparsity). To reach this goal, the tunning parameter p for problem 
fl58|) and the parameters for the GPower methods are appropriately chosen. The results of 
the GPower methods and ALSPCA for the above randomly generated instances are presented 
in Table |U Each column of Table H] has the same meaning as Table First, we clearly see 
that our method substantially outperforms the GPower methods in terms of uncorrelation of 
PCs, orthogonality of loading vectors and CPAV. Further, in comparison with Table El we 
observe that as the number of PCs increases, the CPAV grows accordingly for all methods, 
and moreover, the sparse PCs given by the GPower methods become much more correlated 
and non-orthogonal each other. But the performance of our sparse PCs almost remains same. 
This phenomenon is actually not surprising, given that uncorrelation and orthogonality are 
not well taken into account in the GPower methods. 

5.3 Pitprops data 

In this subsection we test the performance of our approach ALSPCA for finding sparse PCs 
on the real data, that is, Pitprops data. We also compare the results with several existing 
methods [2H El [251 IS] • 

The Pitprops data introduced by Jeffers [H] has 180 observations and 13 measured vari- 
ables. It is a classic example that illustrates the difficulty of interpreting PCs. Recently, 
several sparse PCA methods [T7J [2H1 E] have been applied to this data set for finding six 
sparse PCs by using the actual covariance matrix. For ease of comparison, we present the 
standard PCs, and some of those sparse PCs in Tables [5JE1 respectively. It shall be mentioned 
that two groups of sparse PCs were found by DSPCA with the parameter k\ = 5 or 6, and they 
have similar sparsity and total explained variance (see [5] for details). Thus, we only present 
the latter one (i.e., the one with ki = 6) in Table El Also, we applied the GPower methods 
[16] to this data for finding the PCs with the largest sparsity of the ones given in [281 12SI E], 
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Table 5: Loadings of the first six PCs by standard PCA 



Variable 


PCI 


PC2 


PC3 


PC4 


PC5 


PC6 


topdiam 


0.4038 


0.2178 


0.2073 


0.0912 


0.0826 


0.1198 


length 


0.4055 


0.1861 


0.2350 


0.1027 


0.1128 


0.1629 


moist 


0.1244 


0.5406 


-0.1415 


-0.0784 


-0.3498 


-0.2759 


testsg 


0.1732 


0.4556 


-0.3524 


-0.0548 


-0.3558 


-0.0540 


ovensg 


0.0572 


-0.1701 


-0.4812 


-0.0491 


-0.1761 


0.6256 


ringtop 


0.2844 


-0.0142 


-0.4753 


0.0635 


0.3158 


0.0523 


ringbut 


0.3998 


-0.1897 


-0.2531 


0.0650 


0.2151 


0.0026 


bowmax 


0.2936 


-0.1892 


0.2431 


-0.2856 


-0.1853 


-0.0551 


bowdist 


0.3566 


0.0171 


0.2076 


-0.0967 


0.1061 


0.0342 


whorls 


0.3789 


-0.2485 


0.1188 


0.2050 


-0.1564 


-0.1731 


clear 


-0.0111 


0.2053 


0.0704 


-0.8036 


0.3430 


0.1753 


knots 


-0.1151 


0.3432 


-0.0920 


0.3008 


0.6003 


-0.1698 


diaknot 


-0.1125 


0.3085 


0.3261 


0.3034 


-0.0799 


0.6263 



Pitprops data 



Table 6: Loadings of the first six PCs by SPCA 



Variable 


PCI 


PC2 


PC3 


PC4 


PC5 


PC6 


topdiam 


-0.477 

















length 


-0.476 

















moist 





0.785 














testsg 





0.620 














ovensg 


0.177 





0.640 











ringtop 








0.589 











ringbut 


-0.250 





0.492 











bowmax 


-0.344 


-0.021 














bowdist 


-0.416 

















whorls 


-0.400 

















clear 











-1 








knots 





0.013 








-1 





diaknot 








-0.015 








1 



Pitprops data 



and found the best result was given by GPower^ . Thus, we only report the six sparse PCs ob- 
tained by GPower/ 1 in Table El In addition, we present sparsity, CPAV, non-orthogonality and 
correlation of the PCs obtained by the standard PCA and sparse PCA methods [28l |25| |8~1 IT6] 
in columns two to five of Table [13], respectively. In more details, the second and fifth columns 
respectively give sparsity (measured by the number of zero loadings) and CPAV. The third 
column reports non-orthogonality, which is measured by the maximum absolute difference be- 
tween 90° and the angles formed by all pairs of loading vectors. The fourth column presents 
the maximum correlation of PCs. Though the PCs obtained by these sparse PCA methods 
have nice sparsity, we observe from Tables [13] that they are much correlated, and moreover 
almost all of them are far from orthogonal each other except the ones given by SPCA [28]. To 
improve the quality of sparse PCs, we next apply our approach ALSPCA, and compare the 
results with these methods. For all tests below, we choose ([661) as the termination criterion 
for ALSPCA with e = 0.1 and e } = e E = 10" 3 . 

In the first experiment, we aim to find six uncorrelated and orthogonal sparse PCs by 
ALSPCA while explaining most of variance. In particular, we set r = 6, Ay = 0.07 for all 
i 7^ j and p = 0.8 for formulation flo'Bl of sparse PCA. The resulting sparse PCs are presented 
in Table [TD1 and their sparsity, CPAV, non-orthogonality and correlation are reported in row 
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Table 7: Loadings of the first six PCs by rSVD 





Variable 


PCI 


PC2 


PC3 


PC4 


PC5 


PC6 






topdiam 


-0.449 








-0.114 












length 




-0.460 








-0.102 












moist 







-0.707 


















testsg 







-0.707 


















ovensg 










0.550 








-0.744 






ringtop 




n 1 qq 
-u. -Lyy 





0.546 


-U. 1 1 u 


n 
u 


n 
u 






ringbut 




-0.399 





0.366 















bowmax 


-0.279 








0.422 












bowdist 




-0.380 








0.283 












whorls 




-0.407 











0.231 









clear 













-0.785 


-0.973 









knots 













-0.265 





0.161 






diaknot 










-0.515 








-0.648 












Pitprops data 












Tab] 


c 


8: Loadings of the first 


six PCs by DSPCA 




Variable 




PCI 


PC2 


PC3 


PC4 


PC5 


PC6 


topdiam 




0.4907 

















length 




0.5067 

















moist 







0.7071 














testsg 







0.7071 














ovensg 
















-1.0000 





ringtop 




0.0670 





-0.8731 











ringbut 




0.3566 





-0.4841 











bowmax 




0.2335 

















bowdist 




0.3861 

















whorls 




0.4089 

















clear 



















1.0000 


knots 













1.0000 








diaknot 










0.0569 












Pitprops data 



seven of Table O We easily observe that our method ALSPCA overally outperforms the other 
sparse PCA methods substantially in all aspects except sparsity. Naturally, we can improve 
the sparsity by increasing the values of p, yet the total explained variance may be sacrificed 
as demonstrated in our next experiment. 

We now attempt to find six PCs with similar correlation and orthogonality but higher 
sparsity than those given in the above experiment. For this purpose, we set A^- = 0.07 for all 
i 7^ j and choose p = 2.1 for problem (T3ST) in this experiment. The resulting sparse PCs are 
presented in Table [TTJ The CPAV, non-orthogonality and correlation of these PCs are given 
in row eight of Table [T3l In comparison with the ones given in the above experiment, the PCs 
obtained in this experiment are much more sparse while retaining almost same correlation and 
orthogonality. However, their CPAV goes down dramatically. Combining the results of these 
two experiments, we deduce that for the Pitprops data, it seems not possible to extract six 
highly sparse (e.g., around 60 zero loadings), nearly orthogonal and uncorrelated PCs while 
explaining most of variance as they may not exist. The following experiment further sustains 
such a deduction. 

Finally we are interested in exploring how the correlation controlling parameters Ay (i ^ j) 
affect the performance of the sparse PCs. In particular, we set A^- = 0.5 for all i ^ j and 
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Table 9: Loadings of the first six PCs by GPower^ 



Variable 


PCI 


PC2 


PC3 


PC4 


PC5 


PC6 


topdiam 


-0.4182 

















length 


-0.4205 

















moist 





-0.7472 














testsg 


-0.1713 


-0.6646 














ovensg 














-0.7877 





ringtop 


n OQ/I Q 
-U.zo4o 








U 


n fti fin 


U 


ringbut 


-0.4039 

















bowmax 


-0.3002 

















bowdist 


-0.3677 

















whorls 


-0.3868 

















clear 

















1.0000 


knots 











1.0000 








diaknot 








1.0000 















Pitprops data 








Table 10: Loadings of the first six PCs by ALSPCA 


Variable 


PCI 


PC2 


PC3 


PC4 


PC5 


PC6 


topdiam 


0.4394 

















length 


0.4617 

















moist 


0.0419 


0.4611 


-0.1644 


0.0688 


-0.3127 





testsg 


0.1058 


0.7902 














ovensg 


0.0058 

















ringtop 


0.1302 





0.2094 








0.9999 


ringbut 


0.3477 





0.0515 





0.3240 





bowmax 


0.2256 


-0.3566 














bowdist 


0.4063 

















whorls 


0.4606 














-0.0125 


clear 





0.0369 





-0.9973 








knots 


-0.1115 


0.1614 


-0.0762 


0.0239 


0.8929 





diaknot 


-0.0487 


0.0918 


0.9595 


0.0137 









Pitprops data: Test I 



choose p = 0.7 for problem fl58|) . The obtained sparse PCs are presented in Table [121 The 
CPAV, non-orthogonality and correlation of these PCs are given in the last row of Table [13j 
We see that these PCs are highly sparse, orthogonal, and explain good amount of variance. 
However, they are quite correlated with each other, which is actually not surprising, given 
that Aij(i ^ j) are not small. Despite such a drawback, we observe that these sparse PCs 
still overally outperform those obtained by SPCA, rSVD, DSPCA and GPower^. 

From the above experiments, we may conclude that for the Pitprops data, there do not 
exist six highly sparse, nearly orthogonal and uncorrelated PCs while explaining most of 
variance. Therefore, the most acceptable sparse PCs seem to be the ones given in Table [lOj 

6 Concluding remarks 

In this paper we proposed a new formulation of sparse PCA for finding sparse and nearly 
uncorrelated principal components (PCs) with orthogonal loading vectors while explaining 
as much of the total variance as possible. We also developed a novel globally convergent 
augmented Lagrangian method for solving a class of nonsmooth constrained optimization 
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Table 11: Loadings of the first six PCs by ALSPCA 



Variable 


PCI 


PC2 


PC3 


PC4 


PC5 


PC6 


topdiam 


1.0000 

















length 





-0.2916 


-0.1421 








-0.0599 


moist 





0.9565 


-0.0433 








-0.0183 


testsg 











0.0786 


-0.1330 





ovensg 








-0.9683 











ringtop 




















ringbut 








0.1949 





0.2369 





bowmax 




















bowdist 




















whorls 




















clear 











-0.9969 








knots 








-0.0480 


0.0109 


0.9624 





diaknot 








-0.0093 








0.9980 



Pitprops data: Test II 



Table 12: Loadings of the first six PCs by ALSPCA 



Variable 


PCI 


PC2 


PC3 


PC4 


PC5 


PC6 


topdiam 


0.4051 

















length 


0.4248 

















moist 





0.7262 














testsg 


0.0018 


0.6875 














ovensg 








-1.0000 











ringtop 


0.1856 

















ringbut 


0.4123 

















bowmax 


0.3278 

















bowdist 


0.3830 

















whorls 


0.4437 


-0.0028 














clear 











-1.0000 








knots 














1.0000 





diaknot 

















1.0000 



Pitprops data: Test III 



problems, which is well suited for our formulation of sparse PCA. Additionally, we proposed 
two nonmonotone gradient methods for solving the augmented Lagrangian subproblems, and 
established their global and local convergence. Finally, we compared our sparse PCA approach 
with several existing methods on synthetic, random, and real data, respectively. The com- 
putational results demonstrate that the sparse PCs produced by our approach substantially 
outperform those by other methods in terms of total explained variance, correlation of PCs, 
and orthogonality of loading vectors. 

As observed in our experiments, formulation ([3]) is very effective in finding the desired 
sparse PCs. However, there remains a natural theoretical question about it. Given a set of 
random variables, suppose there exist sparse and uncorrelated PCs with orthogonal loading 
vectors while explaining most of variance of the variables. In other words, their actual covari- 
ance matrix £ has few dominant eigenvalues and the associated orthonormal eigenvectors are 
sparse. Since in practice £ is typically unknown and only approximated by a sample covari- 
ance matrix £, one natural question is whether or not there exist some suitable parameters 
p and Ajj {i ^ j) so that ([3]) is able to recover those sparse PCs almost surely as the sample 
size becomes sufficiently large. 

In Section @] we showed that Robinson's condition ffTUl) holds at a set of feasible points of 
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Table 13: Comparison of SPCA, rSVD, DSPCA, GPower;, and ALSPCA 



Method 


Sparsity 


Non-orthogonality 


Correlation 


CPAV (%) 


PCA 











87.00 


SPCA 


60 


0.86 


0.395 


66.21 


rSVD 


53 


14.76 


0.459 


67.04 


DSPCA 


63 


13.63 


0.573 


60.97 


GPower^ 


63 


10.09 


0.353 


64.15 


ALSPCA-1 


46 


0.03 


0.082 


69.55 


ALSPCA-2 


60 


0.03 


0.084 


39.42 


ALSPCA-3 


63 


0.00 


0.222 


65.97 



Pitprops data 



(j58p . Also, we observed from our experiments that the accumulation points of our augmented 
Lagrangian method lie in this set when applied to fl58|) . and thus it converges. However, it 
remains open whether or not Robinson's condition holds at all feasible points of fl58l) . 

In addition, Burer and Monteiro [3] recently applied the classical augmented Lagrangian 
method to a nonconvex nonlinear program (NLP) reformulation of semidefinite programs 
(SDP), and they obtained some nice computational results especially for the SDP relaxations 
of several hard combinatorial optimization problems. However, the classical augmented La- 
grangian method generally cannot guarantee converging to a feasible point when applied to a 
nonconvex NLP. Due this and [19] , at least theoretically, their approach [5] may not converge 
to a feasible point of the primal SDP. Given that the augmented Lagrangian method proposed 
in this paper converges globally under some mild assumptions, it would be interesting to apply 
it to the NLP reformulation of SDP and compare the performance with the approach studied 
in [5]. 

Finally, the codes of our approach for solving the sparse PCA formulation (T35T) (or, equiva- 
lently, ([3])) are written in Matlab, which are available online at www.math.sfu.ca/~zhaosong. 
As a future research, we will further improve their performance by conducting more extensive 
computational experiments and exploring more practical applications. 
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